AI Production Incident Workflow
Work a live production incident in the right order — triage and stabilize first, then find the cause, then write the summary and postmortem — so the fire is out before the writeup begins.
The problem
An incident is not a bug — it's a bug with a clock and an audience. Diving straight into the root cause is the wrong first move when customers are affected: triage and stabilization come first, diagnosis second, the writeup last. Out of order, you get either a hasty fix that makes things worse or a thorough investigation while the outage drags on. This workflow keeps the order honest — contain it, understand it, then communicate it — and leaves you with a postmortem instead of a fading memory.
Recommended workflow
Each step uses an existing NewPrompt tool, pre-filled by a matching resource. Open the resource to read it, or jump straight into the tool with the inputs ready.
-
Triage and stabilize
First, scope the blast radius and find the fastest safe mitigation — roll back, flag off, fail over — before chasing the underlying cause. Stop the bleeding, then investigate.
Goal The incident contained and impact scoped, before deeper diagnosis.
Open this step in Debugging Prompt GeneratorResource Debug a Production Incident -
Understand the failing path
With the pressure off, get a clear read of the code path that failed, so the diagnosis is grounded in what the system actually does rather than what the dashboards imply.
Goal A grounded understanding of how the failure happened.
Open this step in Code Explanation Prompt -
Write the incident summary
Capture what happened, when, who was affected, and how it was contained — in a structured form stakeholders can read without a war-room replay.
Goal A clear incident summary for the people who weren't in the room.
Open this step in Structured Summary PromptResource Incident Report Summary Prompt -
Produce the postmortem
Turn the summary into a durable postmortem and changelog entry — cause, timeline, fix, and the follow-ups that stop a repeat.
Goal A postmortem and changelog entry, not a memory that fades by Friday.
Open this step in Markdown Output Builder
Expected outcome
The incident is contained, its cause is understood, and you walk away with a stakeholder summary and a written postmortem — instead of a fixed symptom and a story that gets fuzzier each retelling. The proper code fix then runs through the debugging workflow.
Best for
- Responding to a live, customer-facing outage
- Coordinating an incident across people who weren't all there
- Producing a postmortem after the fire is out
Not for
- Fixing a routine bug with no time pressure — use the AI Debugging Workflow
- A performance tweak with no incident — that's ordinary debugging or refactoring
FAQ
How is this different from the AI Debugging Workflow?
Debugging is for fixing a bug properly when you have time. This is for a live incident: the first job is to stabilize and communicate under pressure, and the output is a contained incident plus a postmortem. The deep, tested fix afterward is the debugging workflow's job.
Why write the postmortem inside the workflow?
Because the details are accurate only right after the incident, and they decay fast. Capturing the summary and postmortem while the timeline is fresh is what turns an outage into something the team actually learns from.
Does this replace the actual fix?
No. It contains and documents the incident. The durable, behavior-tested fix runs through the AI Debugging Workflow once the fire is out.