Last year I did incident post mortem based on Atlassian’s article. To keep most juicy part of it as my personal copy I put the table here:

Field

Instructions

Example

Incident summary

Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted.

Between <time range="" of="" incident,="" e.g.="" 14:30="" and="" 15:00=""></time> on  customers experienced . The event was triggered by a deployment at . The deployment contained a code change for . The bug in this deployment caused 

The event was detected by . We mitigated the event by .

This  incident affected X% of customers.

 were raised in relation to this incident. 

Leadup

Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs.

At  on , (), a change was introduced to to ... . The change caused ... 

Fault

Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault.

 responses were incorrectly sent to X% of requests over the course of .

Impact

Describe what internal and external customers saw during the incident. Include how many support cases were raised.

For  between  on , was experienced.

This affected  customers (X% of all  customers), who encountered .

 were raised.

Detection

How and when did Atlassian detect the incident?

How could time to detection be improved? As a thought exercise, how would you have cut the time in half?

The incident was detected when the  was triggered and were paged. They then had to page  because they didn't own the service writing to the disk, delaying the response by .

 will be set up by  so that 

Response

Who responded, when and how? Were there any delays or barriers to our response?

After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58.

Recovery

Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact?

Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?

Recovery was a three-pronged response:

  • Increasing the size of the BuildEng EC2 ASG to increase the number of nodes available to service the workload and reduce the likelihood of scheduling on oversubscribed nodes

  • Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down

  • Reverting the Build Engineering scheduler to the previous version.

Timeline

Provide a detailed incident timeline, in chronological order, timestamped with timezone(s). 

Include any lead-up; start of impact; detection time; escalations, decisions, and changes; and end of impact.

All times are UTC.

11:48 - K8S 1.9 upgrade of control plane finished 
12:46 - Goliath upgrade to V1.9 completed, including cluster-autoscaler and the BuildEng scheduler instance 
14:20 - Build Engineering reports a problem to the KITT Disturbed
14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204) 
14:42 - KITT Disturbed cordons the specific node 
14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic 
15:00 - KITT Disturbed suggests switching to the standard scheduler 
15:34 - BuildEng reports 300 pods failed 
16:00 - BuildEng kills all failed builds with OutOfCpu reports 
16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient. 
16:30 - KITT recognize the failures as an incident and run it as an incident. 
16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem.
16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved.

Five whys

Use the root cause identification technique.

Start with the impact and ask why it happened and why it had the impact it did. Continue asking why until you arrive at the root cause.

Document your "whys" as a list here or in a diagram attached to the issue.

  1. The service went down because the database was locked

  2. Because there were too many databases writes

  3. Because a change was made to the service and the increase was not expected

  4. Because we don't have a development process set up for when we should load test changes

  5. We've never done load testing and are hitting new levels of scale

Root cause

What was the root cause? This is the thing that needs to change in order to stop this class of incident from recurring.

A bug in  connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

Backlog check

Is there anything on your backlog that would have prevented this or greatly reduced its impact? If so, why wasn't it done?

An honest assessment here helps clarify past decisions around priority and risk.

Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted

Recurrence

Has this incident (with the same root cause) occurred before? If so, why did it happen again?

This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452.

Lessons learned

What have we learned?

Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities.

  1. Need a unit test to verify the rate-limiter for work has been properly maintained

  2. Bulk operation workloads which are atypical of normal operation should be reviewed

  3. Bulk ops should start slowly and monitored, increasing when service metrics appear nominal

Corrective actions

What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when? 

Create "Priority action" issue links to issues tracking each action. 

  1. Manual auto-scaling rate limit put in place temporarily to limit failures

  2. Unit test and re-introduction of job rate limiting

  3. Introduction of a secondary mechanism to collect distributed rate information across cluster to guide scaling effects

  4. Large migrations need to be coordinated since AWS ES does not autoscale.

  5. Verify Stride search is still classified as Tier-2

  6. File a ticket to against pf-directory-service to partially fail instead of full-fail when the xpsearch-chat-searcher fails.

  7. Cloudwatch alert to identify a high IO problem on the ElasticSearch cluster

Sources:

That's it for this post, thanks for reading!