investigating Database Issue with one of our Providers
Incident Report for Kimono
Postmortem

Author

Eric Adams - Director of Operations

Summary

From 20:02 UTC - 20:21 UTC (19 minutes) the US Kimono Platform was unavailable due to an issue with one of our databases.

Impact

17 Scheduled collections in the 19:00 UTC scheduled hour failed to complete and were left in a non-terminal state. These had to be manually restarted by Kimono personnel. AFter restart, they completed successfully. The Kimono Dashboard and all Integrations were unavailable for the entirety of the outage. Due to the durable, asynchronous messaging architecture of the Kimono Platform most integrations simply picked up where they left off and were not impacted other than experiencing a 19 minute delay.

Root Causes

The write-ahead logs for our high-availability failover database increased dramatically above the allowed threshold causing the database to go into a recovery mode to prevent any data loss.

Resolution

Our provider increased the limits for the write-ahead logs of the affected database and restarted the database.

Detection

At 20:02 UTC, Kimono received an alert from our provider that one of our databases was not longer communicating with their monitoring tools. Immediately after that our Kimono alerts notified us that our Dashboard and Integrations were unavailable.

Lessons Learned

What went wrong

  • We were caught unaware of a metric and limitation that ultimately caused a primary database to go become unavailable.
  • 17 collections were left incomplete and did not automatically recover after the database was back online.

What went well

  • We were alerted quickly to the issue and our provider was already looking into the outage. We were able to quickly assess the issue and remedy the situation.
  • No data loss
  • Most of the collections, save for the previously noted 17, automatically picked up where they left off and completed successfully.
  • StatusPage Outage was updated to notify all customers of the outage.

What we will do better

  • We have been working with our provider for the past 24 hours to understand more about what happened and how we can mitigate it in the future
  • We have already implemented several mitigation strategies in the short term and jointly monitored their impact with our provider. These strategies include improved pacing of write operations under peak load conditions.
  • Our engineering team is working on implementing longer term strategies to address root causes which contributed to the write-ahead logs increasing rapidly.

Timeline

Time (UTC) Description
20:02 Received notification from provider that they had been unable to communicate with one of our databases
20:02 Kimono Alerts came in that our database was unable to accept communications
20:03 Began working with Provider to remedy situation
20:10 Provider had identified the issue and instructed Kimono Ops team on next steps.
20:21 All Kimono processes were restored. The outage was over.

Posted 4 months ago. Aug 17, 2018 - 10:14 MDT

Resolved
We have been monitoring for the last few hours. All Integration, Dashboard, and Connectors have restored to normal processing.
Posted 4 months ago. Aug 15, 2018 - 19:15 MDT
Monitoring
We have identified the issue and are monitoring it now. All services should be back online.
Posted 4 months ago. Aug 15, 2018 - 16:28 MDT
Investigating
We are currently investigating a database issue with one of our providers. All Ingestions are currently paused, and the Kimono dashboard is down. The SIF Zone Services appear to still be operational.
Posted 4 months ago. Aug 15, 2018 - 16:14 MDT
This incident affected: Dashboard, Kimono Platform, Kimono Platform API and Integrations (Canvas Integration, Gauge Integration, Kimono Directories for Active Directory, Kimono Directories for G Suite, Kimono Grades).