[RESOLVED] - D2 Designated Pool Grades jobs not processing for a significant time period
Incident Report for Kimono
Resolved
Issue:
Grades Jobs were not processed for D2 designated pool clients from 2021-02-09T11:22 UTC until 2021-02-10T12:27 UTC. We are currently processing through the backlog of work that was generated during this time. We have doubled the available workers to process through this backlog as swiftly as possible.

Who is affected:
Any customer that attempted to create a Grades Exchange that is in the D2 Designated Pool.

What happened:
At 11:22 UTC on February 9th, the D2 Canvas-Driver Grades app was restarted by our provider as a part of their normal 24-hour restart process. When the application restart it received an error that it was unable to start up successfully. The app attempted to restart again, but never came back up successfully. We are unsure yet as to the cause of this, and have reached out to our provider to ask for some further clarification. At is known that at this time, the provider was experiencing API issues, but none that appear related to issues with our application restarting. We will update this as soon as we have more information regarding the issues that caused the restart to hang.

Because this application never fully restarted, no grades work for the D2 designated pool was moved into the job queues to begin their processing by our grades workers.

Once the app successfully restarted itself at 2021-02-10T12:27 UTC, it picked up all the accumulated jobs and begin placing them in the job queues to start processing.

When we knew:
We have been investigating this since 8:45 MST this morning when we saw an anomaly of grades jobs in the D2 work queues.

What we could have done better:
We strive to have eyes and alerts all over our platform to try and be as proactive as possible when these types of issues arise. We found that there is a place where the jobs live prior to being placed onto any queue where we could have identified sooner that a backlog of work was accruing. We have now created an alert around this place and will be notified again immediately if we start to see similar backlogs take place.

Further, we have been planning to migrate the last of our applications off of our legacy provider during a maintenance period this month. Our new platform provides us far more flexibility in scaling and predicting load than the legacy one. Given this, we have already moved the Canvas Driver D2 grades application to the new platform to help ensure this type of faulty restart does not occur again.

We apologize for the inconvenience this has caused those customers in the D2 pool. These types of issues are never acceptable to us and we continue to improve our platform and processes to mitigate these issues in the future.

Kimono Operations
Posted Feb 09, 2021 - 05:30 MST