Degraded Performance on Ingestions
Incident Report for Kimono
Postmortem

Author

Eric Adams - Director of Operations

Summary

A recent upgrade of our underlying infrastructure caused a configuration to not be applied properly, which in turn caused ingestions to process much slower than normal. A fix was applied late Monday night around 23:30 and it was monitored throughout the night with success.

Impact

All ingestions were slower to process than normal from Sunday night and through Monday.

Root Causes

The upgrade of the underlying infrastructure caused an environment variable that determines how Kimono processes source_id and MatchKeys for ingestions to no longer be applied.

Resolution

Kimono applied a hotfix release with a new environment variable to apply the new configuration in the new infrastructure.

Lessons Learned

What went wrong

  • We identified the slowness early Monday morning and began investigating.
  • It took most of the day to identify what was causing the slowness
  • Many collections were FAILED or CANCELLED during this period.

What went well

  • The Kimono Engineering and Operations teams worked closely together to identify and remediate the issue.

What we will do better

  • We have been working with our infrastructure provider for the past 24 hours to understand more about why their upgrade caused a working environment variable to stop working and how we can mitigate similar issues in the future

Timeline

Time (UTC) Description
Monday, October 8th, 06:00 It was identified that many collections that had started the previous night were not processing at the normal rate. The Kimono Engin
Monday, October 8th, 09:00 Processing issues were escalated from the Operations team to the Engineering team to begin working on a resolution
Monday, October 8th, 18:00 Kimono Engineering identified the potential issue and began working on a hotfix to remediate the slowness.
Monday, October 8th, 22:00 Hotfix was deployed and tested in our QA environment and proven to have the necessary changes.
Monday, October 8th, 23:15 The hotfix was successfully deployed to the Production environment.
Posted 11 days ago. Oct 09, 2018 - 13:33 MDT

Resolved
Ingestions are continuing to process as expected. We will post a post-mortem later today.
Posted 11 days ago. Oct 09, 2018 - 06:04 MDT
Monitoring
Ingestions have returned to normal processing and we are currently monitoring their progress.
Posted 11 days ago. Oct 08, 2018 - 22:37 MDT
Identified
We believe we have identified the issue with the ingestion slowness and are working on a mitigation strategy. We will update as new information is available.
Posted 11 days ago. Oct 08, 2018 - 20:58 MDT
Update
We are continuing to investigate the slowness with the ingestions. We will continue to update as new information is made available.
Posted 12 days ago. Oct 08, 2018 - 13:01 MDT
Investigating
We are currently investigating degraded performance on our ingestions. We will provide updates as soon as they are available.
Posted 12 days ago. Oct 08, 2018 - 11:04 MDT