Operations

Debug a Production Incident

Stabilize, then diagnose: impact first, rollback options before intervention, timeline from the deploy history — forensic discipline under fire.

Open in Debugging Prompt Generator

Overview

Production incidents punish improvisation twice: once in the outage, once in the confused retelling. This setup runs the incident strategy under forensic mode on an ongoing checkout outage: impact assessment and containment explicitly ordered before root cause, rollback options reviewed before any intervention, the timeline established from deployment history and the symptom window, monitoring gaps recorded as findings — and recovery verified in monitoring, not in the absence of complaints. The incident checklist covers the usual suspects: deploys, infrastructure changes, dependency outages.

How to use this resource

Impact before cause

Who is affected, how badly, is it ongoing — the contract refuses to theorize before the blast radius is known.
Check rollback first

The fastest fix is often reverting — the contract requires knowing the rollback options before trying anything cleverer.
Verify recovery in monitoring

Error rates at baseline on the dashboards — silence from customers is not a recovery signal.

Why This Works

Stabilize-first ordering matches how incidents are actually survived
Deploy-window checklist finds the cause in the place it usually is
Monitoring-verified recovery prevents the second outage announcement

Best for

On-call engineers facing an active incident
Teams without a formal incident command process
Outages where "what changed?" has six answers

Not for

The blameless post-mortem afterward — that's the Root Cause Analysis setup, run at leisure
Pre-deploy gating — that's the Code Review Prompt Generator's production-readiness review

Use cases

Working an ongoing outage with structure instead of panic
Ordering containment before diagnosis explicitly
Building the timeline the post-mortem will need

FAQ

Why does this incident prompt force impact assessment before root cause instead of diagnosing first?

The SYSTEM CONTEXT orders it deliberately: 'Assess customer impact first' and 'Review rollback options before any intervention' both precede diagnosis, on the logic that 'a controlled degradation beats an uncontrolled outage.' So the prompt scopes severity and reversibility of the checkout outage before theorizing about the payments-gateway timeouts. You still decide whether to roll back or hold.

How should I label a claim like 'the payments pool exhaustion caused the failures' in the output?

Under the EVIDENCE RULES it is a HYPOTHESIS, a testable causal claim that ships with its test. The log line '14:06:02 WARN connection pool exhausted pool=payments size=50' is a FACT, but 'the pool caused the 40% failures' stays a HYPOTHESIS until validated. The generated prompt enforces these labels; the assistant you run it in fills them.

When the checkout error rate drops back down, can I call the incident resolved?

Not from customer silence alone. The POST-FIX VERIFICATION section wants error rates back at baseline (under 1%) on the dashboards, checks for regressions in the paths the fix touched, and asks you to state an observation window before declaring resolution. The prompt structures that check; your monitoring confirms recovery, since NewPrompt doesn't watch your dashboards.

Customize This Resource

Opens this setup in Debugging Prompt Generator. Generate to get the full investigation contract — then adjust the problem type, mode, and environment.

Open in Debugging Prompt Generator

Prompt Template

Copy it as-is, or use Open in Debugging Prompt Generator to load it pre-filled and customize it with your own context.

DEBUGGING OBJECTIVE
Investigate the ongoing checkout outage and get to a safe recovery.
Problem type: production incident — stabilize first; impact and containment come before root cause.
Determine what is actually happening. Do not jump directly to a solution before identifying the most likely root cause.

SYSTEM CONTEXT
Environment: PRODUCTION — investigate with production discipline:
- Assess customer impact first: scope, severity, and whether it is ongoing.
- Review rollback options before any intervention — the fastest fix may be reverting.
- Review the deployment history for the symptom window.
- Review observability: dashboards, alerts, and logs for the affected window — note what monitoring missed.
- After any fix, verify recovery in monitoring — not just the absence of complaints.
Reported severity: Critical.
Investigation strategy for this problem type:
- Assess impact and contain before diagnosing — a controlled degradation beats an uncontrolled outage.
- Establish the timeline: what changed, when symptoms started, how they spread.
- Plan recovery and prevention as part of the investigation, not as afterthoughts.

KNOWN FACTS
Symptoms: Checkout error rate jumped from 0.2% to 40% at 14:05 UTC and is still elevated.
Expected behavior: Checkout error rate under 1%.
Actual behavior: 40% of checkout attempts fail with gateway timeouts.
Reproduction steps: none provided (a fact — see the framework below).
Logs / errors:
```
14:05:12 ERROR gateway timeout upstream=payments latency=30000ms
14:05:13 ERROR gateway timeout upstream=payments latency=30000ms
14:06:02 WARN connection pool exhausted pool=payments size=50
```

INVESTIGATION FRAMEWORK
Structure the investigation in exactly these stages:
1. SYMPTOMS — restate the observable symptoms, separating observations from interpretations.
2. KNOWN FACTS — list what is directly evidenced, each fact with its source.
3. INFORMATION NEEDED TO REPRODUCE — no reproduction steps were provided: list exactly what is needed to reproduce this problem before diagnosing further. Do not pretend a reproduction exists.
4. POSSIBLE CAUSES — candidate causes, drawn from the checklist below and from the evidence.
5. ROOT CAUSE ANALYSIS — the most likely cause and the alternatives, per the root cause rules.
6. EVIDENCE REVIEW — which evidence is solid, which is weak, which is missing, and what contradicts what.
7. MISSING INFORMATION — what you need that you do not have. Ask for it; do not guess it.
8. VALIDATION PLAN — the tests or observations that will confirm or eliminate the leading cause.
9. FIX RECOMMENDATION — per the fix requirements below.
10. POST-FIX VERIFICATION — per the verification requirements below.

DEBUGGING CHECKLIST
Candidate causes to check for this problem type:
1. Deployments or releases in the incident window
2. Infrastructure changes: scaling, configuration, certificates, DNS
3. Customer impact scope: who is affected, how many, how badly
4. Rollback options and the risks of each
5. Monitoring gaps that delayed detection
6. Upstream or downstream dependency outages

ROOT CAUSE RULES
- Identify the MOST LIKELY root cause — and at least two ALTERNATIVE causes.
- For each candidate cause: list the evidence supporting it AND the evidence contradicting it.
- Define a validation step for each candidate: what observation would confirm or eliminate it.
- The symptom site and the cause site may be different places — finding where it hurts is not finding why.

EVIDENCE RULES
Label every statement as exactly one of:
- FACT: directly observed. Example: "The application returned HTTP 500."
- ASSUMPTION: believed without direct evidence. Example: "The database is probably down." Assumptions must be verified or discarded — never silently promoted to facts.
- HYPOTHESIS: a testable causal claim. Example: "A failed database connection may be causing the 500 response." Every hypothesis comes with its test.
- Every conclusion must cite the specific evidence that supports it — quote the log line, the stack frame, the reproduction result.
- Unknowns remain unknown: write "insufficient evidence" rather than a plausible guess.
- No unsupported certainty: every claim carries its confidence level and its evidence.
- Treat the absence of expected evidence as a finding in itself.
- Assumptions are forbidden in conclusions — they may appear only in MISSING INFORMATION, as questions to resolve.

VALIDATION REQUIREMENTS
- Before proposing any fix, state how the diagnosis was validated — or what validation is still pending.
- The gold standard: a reproduction that turns the failure on and off by toggling the suspected cause.
- If the root cause cannot be validated with the available information, say so and list what is needed.

FIX REQUIREMENTS
- Any fix proposal must include: the likely cause, the supporting evidence, the proposed change, and how to verify it worked.
- Frame it as "here is why this is likely the fix" — never an unexplained patch.
- Prefer the smallest fix that addresses the root cause; label symptom-only fixes as temporary mitigations.

POST-FIX VERIFICATION
- Define what to observe after the fix: the original reproduction passing, error rates at baseline, the affected metric recovered.
- Check for regressions in the paths the fix touched.
- State how long to observe before declaring the problem resolved — intermittent problems need longer windows.

More resources from Debugging Prompt Generator

Resource

Debugging Prompt — the Investigation Contract

"Fix this error" gets guesses. The investigation contract gets a ten-stage diagnosis: facts separated from assumptions, alternatives weighed, fixes justified.

Prompt Engineering

Resource

Error Analysis Prompt

An exception is a symptom, not a diagnosis: trace from the throw site back to the root trigger, with the runtime checklist on the table.

Engineering

Resource

Debug Performance Problems

Measure before reasoning: find WHERE the time goes, separate latency from memory from throughput, and think at production scale.

Engineering

Resources that pair well

Resource

Code Review Prompt — the Review Contract

"Review this code" gets shallow comments. The review contract gets findings with severities, a checklist, and a verdict.

Prompt Engineering

Resource

Playwright Test Prompt

getByRole over CSS chains, auto-wait over sleep, web-first assertions — Playwright tests written the way Playwright wants.

Engineering

Resource

Refactor Prompt — the Behavior Preservation Contract

"Refactor this code" invites silent behavior changes. The refactoring contract preserves business rules, outputs, and side effects — and flags uncertainty instead of deciding it.

Prompt Engineering

Related tools

Tool

Debugging Prompt Generator

Build debugging prompts from symptoms, environment, and reproduction steps — root cause first, fix second.

Coding Workflows

Workflows that use this resource

Workflow

AI Production Incident Workflow

Work a live production incident in the right order — triage and stabilize first, then find the cause, then write the summary and postmortem — so the fire is out before the writeup begins.

4 steps varies with the incident

Tip: Save time by exploring related resources and tools that integrate with this resource.