Post Mortem Quick Guide
Last year I did incident post mortem based on Atlassian’s article. To keep most juicy part of it as my personal copy I put the table here:
Field |
Instructions |
Example |
Incident summary |
Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted. |
Between <time range="" of="" incident,="" e.g.="" 14:30="" and="" 15:00=""></time> on The event was detected by This |
Leadup |
Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs. |
At on |
Fault |
Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault. |
|
Impact |
Describe what internal and external customers saw during the incident. Include how many support cases were raised. |
For This affected |
Detection |
How and when did Atlassian detect the incident? How could time to detection be improved? As a thought exercise, how would you have cut the time in half? |
The incident was detected when the |
Response |
Who responded, when and how? Were there any delays or barriers to our response? |
After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58. |
Recovery |
Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact? Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half? |
Recovery was a three-pronged response:
|
Timeline |
Provide a detailed incident timeline, in chronological order, timestamped with timezone(s). Include any lead-up; start of impact; detection time; escalations, decisions, and changes; and end of impact. |
All times are UTC. 11:48 - K8S 1.9 upgrade of control plane finished12:46 - Goliath upgrade to V1.9 completed, including cluster-autoscaler and the BuildEng scheduler instance 14:20 - Build Engineering reports a problem to the KITT Disturbed 14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204) 14:42 - KITT Disturbed cordons the specific node 14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic 15:00 - KITT Disturbed suggests switching to the standard scheduler 15:34 - BuildEng reports 300 pods failed 16:00 - BuildEng kills all failed builds with OutOfCpu reports 16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient. 16:30 - KITT recognize the failures as an incident and run it as an incident. 16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem. 16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved. |
Five whys |
Use the root cause identification technique. Start with the impact and ask why it happened and why it had the impact it did. Continue asking why until you arrive at the root cause. Document your "whys" as a list here or in a diagram attached to the issue. |
|
Root cause |
What was the root cause? This is the thing that needs to change in order to stop this class of incident from recurring. |
A bug in |
Backlog check |
Is there anything on your backlog that would have prevented this or greatly reduced its impact? If so, why wasn't it done? An honest assessment here helps clarify past decisions around priority and risk. |
Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted |
Recurrence |
Has this incident (with the same root cause) occurred before? If so, why did it happen again? |
This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452. |
Lessons learned |
What have we learned? Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities. |
|
Corrective actions |
What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when? Create "Priority action" issue links to issues tracking each action. |
|
Sources:
That's it for this post, thanks for reading!