We Replaced Our Incident Runbook With an AI Postmortem — Here's What We Learned

Last October, our Stripe webhook processor silently backed up. Not a hard failure — the queue was accepting events, the API was responding, all our status checks were green. But webhook delivery had stalled. For three hours, subscription lifecycle events were piling up unprocessed: renewals, cancellations, trial expirations.

We caught it via a support ticket. A customer noticed their account hadn't upgraded after a successful payment. By the time we triaged, we were looking at 847 queued webhooks, roughly $14,000 in affected renewals, and a growing thread in Slack with six engineers speculating about what had happened.

We fixed the issue in 22 minutes. We spent the next four hours writing the postmortem.

The postmortem problem

Incident postmortems are genuinely valuable. They force structured reflection, surface systemic issues, and create an organizational memory that prevents repeat failures. The problem is the process of writing them is painful enough that teams cut corners, delay, or skip them entirely.

The pain is almost entirely in timeline reconstruction. You're pulling from Slack threads, scattered log queries, PagerDuty timelines, and the memory of engineers who were half-asleep during an on-call page at 2am. The actual analysis — what failed, why, what to fix — takes 20 minutes. The data gathering takes four hours.

So we built automated postmortems into Apixor. Here's what the output actually looks like for a real incident.

What an AI-generated postmortem looks like

AI Postmortem — Stripe Webhook Backlog · June 9, 2026

INCIDENT SUMMARY
Stripe webhook delivery stalled for 3h 12m due to a consumer
process deadlock introduced in deploy d4f8a9c. 847 events queued.
Estimated revenue impact: $14,200 in delayed renewal processing.

TIMELINE
09:14  Deploy d4f8a9c pushed to production (webhook consumer)
09:17  Webhook queue depth begins increasing (anomaly threshold: +3σ)
09:19  Apixor alert: Stripe webhook delivery rate dropped 94%
09:19  Alert routed to #incidents Slack channel
12:31  Customer support ticket opened (first external signal)
12:34  Engineer pages on-call; investigation begins
12:56  Root cause identified: consumer lock contention on Redis key
13:18  Fix deployed, queue drain begins
14:02  Queue fully processed, all affected renewals resolved

ROOT CAUSE
A race condition in the new idempotency key implementation caused
consumer workers to deadlock when processing concurrent renewal
events for the same customer. The lock was never released, causing
all subsequent consumers to block indefinitely.

IMPACT
- 847 webhook events delayed (0 permanently lost)
- 312 subscription renewals delayed by 3h+
- 14 trial expirations not processed on schedule
- Estimated revenue at risk: $14,200 (all recovered post-drain)

ACTION ITEMS
[ ] Add lock timeout with exponential backoff to idempotency handler
[ ] Add webhook queue depth to existing anomaly detection dashboard
[ ] Create runbook for consumer deadlock recovery
[ ] Review similar patterns in SendGrid and HubSpot consumers

What it got right

The timeline reconstruction was the biggest win. Apixor had the full event sequence automatically because it was continuously monitoring the integration. The alert fired at 09:19 — we just hadn't responded to it quickly enough. That data was already captured, timestamped, and correlated with the deployment event.

The impact calculation was also accurate. Because we had revenue mapping configured for Stripe, the system knew which webhook events corresponded to subscription lifecycle events and could estimate revenue at risk from queue depth and event type distribution.

The action items needed some editing, but they were surprisingly good starting points. Catching that the same idempotency pattern might be in other consumers — that was genuinely useful.

What still needs a human

Root cause attribution. The AI correctly identified the symptom (lock contention) but the architectural context — why we chose that idempotency approach, what the original Redis key design was trying to solve, what the tradeoff was — that's all institutional knowledge that lives in Slack threads from six months ago and an engineer's head.

The AI's root cause section was accurate but thin. A human engineer added two paragraphs about the original design intent that made the postmortem genuinely useful for the next engineer who runs into a similar issue. That part can't be automated yet.

The postmortem that used to take four hours now takes 25 minutes. Most of that 25 minutes is one engineer reading the AI draft, editing the root cause section, and approving the action items. The timeline, impact, and sequence are already done.

The workflow now

Anomaly detected → plain-English Slack summary within 2 minutes of threshold breach ("Stripe error rate 18x above baseline, estimated $4k/hr revenue impact")
On-call engineer gets full incident context in the Slack message — no need to open a dashboard first
When the incident resolves, AI postmortem draft is posted to #incidents automatically
On-call engineer spends ~20 minutes on root cause and edits, approves the draft
Postmortem is archived; action items are tracked

The thing nobody tells you about postmortems

The value of a postmortem isn't the document. It's the forcing function — the structured moment where your team has to articulate what failed and why. That part still requires human judgment. What AI can do is eliminate all the busywork that makes teams avoid or delay that moment.

If your team is skipping postmortems because they take too long, the problem isn't discipline — it's tooling. The four-hour timeline reconstruction was a tax on learning. Removing it has made our postmortem rate go from "whenever we get around to it" to 100% of incidents, every time.

That consistency is worth more than any single postmortem document.