Skip to content
a@o:~$

cat /blog/forensic-method.md

· 8 min · forensics · incident-response · methodology · debugging

The forensic method: six phases for production incidents

Production debugging fails when it runs on intuition: the loudest hypothesis wins and the evidence evaporates. This is the six-phase method I use instead — data before theories, and no hypothesis survives without surviving an attempt to kill it.

Most production debugging is conducted like a séance. Someone senior names a suspect in the first five minutes, everyone starts looking for evidence that confirms it, and the artifacts that could have proven them wrong get destroyed by the first well-intentioned restart. When the incident closes, nothing is written down, so the same séance reconvenes three months later.

After enough of those, I formalized what I actually do when it works. Six phases, two non-negotiable principles. I have used this on enterprise production incidents, on personal projects, and on a build failure caused by a single capital letter. The method is the same at every scale; only the stakes change.

the two principles

  • Data first.No theories until the evidence is collected. The order matters because theories contaminate collection: once you believe it's the cache, you only look at the cache.
  • Refutation required. You do not gather support for a hypothesis; you design the experiment that would kill it. A hypothesis is only allowed to survive by surviving. Confirmation feels faster and is how teams burn a day on the wrong suspect.

the six phases

phase 01

preserve the scene

what will stop being true in an hour?

Before anything else, capture what rots: exact error text, timestamps, log windows, process state, versions deployed, recent changes, who is affected and since when. Restarting a service is destroying a crime scene; sometimes you must (mitigation outranks diagnosis when users are bleeding), but do it knowingly and screenshot first. The instinct this phase fights: the reflexive restart that "fixes" it and guarantees a rerun next week.

exit criteria: volatile evidence captured; nothing destructive has run
phase 02

define the symptom

what exactly is failing, for whom, since when?

Convert "it's broken" into a statement precise enough to be wrong: expected X, observed Y, scope Z, first seen T. Half of all incidents change shape in this phase — the report says "the API is down" and the symptom turns out to be "one endpoint times out for accounts with more than 10k rows." If you cannot state when it last worked, finding that out becomes the first task.

exit criteria: a falsifiable one-sentence symptom statement
phase 03

collect before theorizing

what does the system say happened?

Now the data: logs, query plans, traces, diffs of everything that changed near T-zero (deploys, config, data, dependencies, infrastructure). Build a timeline where every entry cites its source. Facts only — "response time tripled at 14:02" qualifies, "the new release broke it" does not, because that is a theory wearing a fact's clothes.

exit criteria: a timeline of facts, each with a source
phase 04

the hypothesis ledger

what could explain this, and what would kill each candidate?

Enumerate every explanation consistent with the timeline, including the unflattering ones. For each, design the cheapest experiment that would disprove it, then run them cheapest-first. The pristine-baseline test belongs here: reproduce on a clean environment before suspecting your own code. Record kills in the ledger — a refuted hypothesis is paid-for knowledge, and the ledger is what stops the team from re-investigating it at 2am.

exit criteria: every open hypothesis has survived a designed refutation
phase 05

convict the cause

can you switch the failure on and off?

The surviving hypothesis still has to earn a conviction: reintroduce the cause and the failure must return; remove it and the failure must vanish. Where you can, get a second witness — a different tool observing the same fault tells you which parts of the story are real. In the capital-letter incident, webpack was the second witness that made Turbopack's vague invariant confess. If bidirectional proof is impossible (it happens: unreproducible data races, third-party black boxes), say so in the report instead of upgrading suspicion to fact.

exit criteria: cause demonstrated in both directions, or explicitly downgraded to 'best supported'
phase 06

intervene and write the autopsy

what prevents the rerun?

The fix should be as small as the conviction allows — large "while we're in here" fixes smuggle new suspects into the next incident. Add the guard that makes the regression loud (test, alert, invariant, lint rule), then write the autopsy: symptom, timeline, ledger with kills, conviction evidence, fix, guard. Twenty minutes of writing converts the incident from tribal memory into infrastructure. It is also, not incidentally, how this blog gets its material.

exit criteria: minimal fix shipped, regression guard in place, report written

why the order is the method

Nothing in the six phases is exotic; every senior engineer does each piece sometimes. The discipline is the sequence. Evidence before theories (1–3) keeps collection honest. Refutation before conviction (4–5) keeps the loudest voice in the room from deciding the outcome. Writing before closing (6) keeps the organization from paying for the same lesson twice.

Under pressure you do not rise to your insight; you fall back on your procedure. Make the procedure one that cannot fool itself.