Coding Workflows Workflow Advanced

AI Production Incident Workflow

Work a live production incident in the right order — triage and stabilize first, then find the cause, then write the summary and postmortem — so the fire is out before the writeup begins.

The problem

An incident is not a bug — it's a bug with a clock and an audience. Diving straight into the root cause is the wrong first move when customers are affected: triage and stabilization come first, diagnosis second, the writeup last. Out of order, you get either a hasty fix that makes things worse or a thorough investigation while the outage drags on. This workflow keeps the order honest — contain it, understand it, then communicate it — and leaves you with a postmortem instead of a fading memory.

Recommended workflow

Each step uses an existing NewPrompt tool, pre-filled by a matching resource. Open the resource to read it, or jump straight into the tool with the inputs ready.

  1. Triage and stabilize

    First, scope the blast radius and find the fastest safe mitigation — roll back, flag off, fail over — before chasing the underlying cause. Stop the bleeding, then investigate.

    Goal The incident contained and impact scoped, before deeper diagnosis.

    Open this step in Debugging Prompt Generator
  2. Understand the failing path

    With the pressure off, get a clear read of the code path that failed, so the diagnosis is grounded in what the system actually does rather than what the dashboards imply.

    Goal A grounded understanding of how the failure happened.

    Open this step in Code Explanation Prompt
  3. Write the incident summary

    Capture what happened, when, who was affected, and how it was contained — in a structured form stakeholders can read without a war-room replay.

    Goal A clear incident summary for the people who weren't in the room.

    Open this step in Structured Summary Prompt
  4. Produce the postmortem

    Turn the summary into a durable postmortem and changelog entry — cause, timeline, fix, and the follow-ups that stop a repeat.

    Goal A postmortem and changelog entry, not a memory that fades by Friday.

    Open this step in Markdown Output Builder

Expected outcome

The incident is contained, its cause is understood, and you walk away with a stakeholder summary and a written postmortem — instead of a fixed symptom and a story that gets fuzzier each retelling. The proper code fix then runs through the debugging workflow.

Best for

  • Responding to a live, customer-facing outage
  • Coordinating an incident across people who weren't all there
  • Producing a postmortem after the fire is out

Not for

  • Fixing a routine bug with no time pressure — use the AI Debugging Workflow
  • A performance tweak with no incident — that's ordinary debugging or refactoring

FAQ

How is this different from the AI Debugging Workflow?

Debugging is for fixing a bug properly when you have time. This is for a live incident: the first job is to stabilize and communicate under pressure, and the output is a contained incident plus a postmortem. The deep, tested fix afterward is the debugging workflow's job.

Why write the postmortem inside the workflow?

Because the details are accurate only right after the incident, and they decay fast. Capturing the summary and postmortem while the timeline is fresh is what turns an outage into something the team actually learns from.

Does this replace the actual fix?

No. It contains and documents the incident. The durable, behavior-tested fix runs through the AI Debugging Workflow once the fire is out.

Tip: Each step's resource opens its tool pre-filled — start at step one and carry the output forward.

All playbooks