Post Mortem Quick Guide

2022-06-08

Last year I did incident post mortem based on Atlassian’s article. To keep most juicy part of it as my personal copy I put the table here:

Field	Instructions	Example
Incident summary	Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted.	Between <time range="" of="" incident,="" e.g.="" 14:30="" and="" 15:00=""></time> on , customers experienced . The event was triggered by a deployment at . The deployment contained a code change for . The bug in this deployment caused . The event was detected by . We mitigated the event by . This incident affected X% of customers. were raised in relation to this incident.
Leadup	Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs.	At on , (), a change was introduced to to ... . The change caused ... .
Fault	Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault.	responses were incorrectly sent to X% of requests over the course of .
Impact	Describe what internal and external customers saw during the incident. Include how many support cases were raised.	For between on , was experienced. This affected customers (X% of all customers), who encountered . were raised.
Detection	How and when did Atlassian detect the incident? How could time to detection be improved? As a thought exercise, how would you have cut the time in half?	The incident was detected when the was triggered and were paged. They then had to page because they didn't own the service writing to the disk, delaying the response by . will be set up by so that .
Response	Who responded, when and how? Were there any delays or barriers to our response?	After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58.
Recovery	Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact? Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?	Recovery was a three-pronged response: Increasing the size of the BuildEng EC2 ASG to increase the number of nodes available to service the workload and reduce the likelihood of scheduling on oversubscribed nodes Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down Reverting the Build Engineering scheduler to the previous version.
Timeline	Provide a detailed incident timeline, in chronological order, timestamped with timezone(s). Include any lead-up; start of impact; detection time; escalations, decisions, and changes; and end of impact.	All times are UTC. 11:48 - K8S 1.9 upgrade of control plane finished 12:46 - Goliath upgrade to V1.9 completed, including cluster-autoscaler and the BuildEng scheduler instance 14:20 - Build Engineering reports a problem to the KITT Disturbed 14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204) 14:42 - KITT Disturbed cordons the specific node 14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic 15:00 - KITT Disturbed suggests switching to the standard scheduler 15:34 - BuildEng reports 300 pods failed 16:00 - BuildEng kills all failed builds with OutOfCpu reports 16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient. 16:30 - KITT recognize the failures as an incident and run it as an incident. 16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem. 16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved.
Five whys	Use the root cause identification technique. Start with the impact and ask why it happened and why it had the impact it did. Continue asking why until you arrive at the root cause. Document your "whys" as a list here or in a diagram attached to the issue.	The service went down because the database was locked Because there were too many databases writes Because a change was made to the service and the increase was not expected Because we don't have a development process set up for when we should load test changes We've never done load testing and are hitting new levels of scale
Root cause	What was the root cause? This is the thing that needs to change in order to stop this class of incident from recurring.	A bug in connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.
Backlog check	Is there anything on your backlog that would have prevented this or greatly reduced its impact? If so, why wasn't it done? An honest assessment here helps clarify past decisions around priority and risk.	Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted
Recurrence	Has this incident (with the same root cause) occurred before? If so, why did it happen again?	This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452.
Lessons learned	What have we learned? Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities.	Need a unit test to verify the rate-limiter for work has been properly maintained Bulk operation workloads which are atypical of normal operation should be reviewed Bulk ops should start slowly and monitored, increasing when service metrics appear nominal
Corrective actions	What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when? Create "Priority action" issue links to issues tracking each action.	Manual auto-scaling rate limit put in place temporarily to limit failures Unit test and re-introduction of job rate limiting Introduction of a secondary mechanism to collect distributed rate information across cluster to guide scaling effects Large migrations need to be coordinated since AWS ES does not autoscale. Verify Stride search is still classified as Tier-2 File a ticket to against pf-directory-service to partially fail instead of full-fail when the xpsearch-chat-searcher fails. Cloudwatch alert to identify a high IO problem on the ElasticSearch cluster

Sources:

https://www.atlassian.com/incident-management/handbook/postmortems#postmortem-issue-fields

That's it for this post, thanks for reading!