Skip to content
Go back

Differential diagnosis: debugging like a doctor?

What is a “fix-it ladder”? Did you make that up?

Pretty much, yeah. I made it up on the fly.

It’s not a term of art. I was reaching for something evocative to label “ordered list of fixes, try each one if the previous fails” — the same shape as an “escalation ladder” in incident response, or a “troubleshooting tree.” The “fix-it” prefix just felt natural next to “ladder.”

If you want a more standard phrase: “remediation steps,” “troubleshooting sequence,” or “escalation path” all mean the same thing and would be recognizable to other engineers.

I get it. I am an engineer. For realsies. I am just curious if it is a conventional term I had not heard of.

Not conventional. I coined it in the moment. Closest real terms you’d recognize: escalation ladder, remediation steps, troubleshooting tree.

“Tree” suggests conditional branching, not just a sequence where the only branching is whether you are done or whether to continue. “Remediation” also does not suggest sequential alternatives. It could be that all three steps are always needed, rather than that each step narrows the space of possible causes. “Escalation ladder” suggests increasing severity, not decreasing the space of possible causes.

You’re onto something here. It deserves a name. A sequence of tests, chosen and ordered intentionally to successively narrow the set of possible causes, under some theory of the problem.

You’re right — and it’s a concept I don’t think software engineering has crisply named. The closest established analog comes from medicine: differential diagnosis. A physician holds a ranked set of candidate causes, then orders tests chosen specifically to maximize discrimination between them — cheapest and most informative first. Exactly your definition.

Adjacent names that each capture one facet but not the whole:

  • Bisection / divide-and-conquer debugging / git bisect — captures “narrow the search space” but implies halving under a total order; doesn’t handle unordered hypotheses.
  • Fault isolation (classical systems/telecom engineering) — right meaning, but the term is vague and usually describes the goal, not the method.
  • Bayesian troubleshooting / optimal test sequencing — the formal version; appears in reliability engineering literature. Each test chosen to maximize expected information gain (entropy reduction) about which hypothesis is true. Academic, not in working-engineer vocabulary.
  • Dichotomous key (biology) — a pre-built sequence for identifying a specimen. Right shape, but static, not adaptive.

So: the thing you’re pointing at is real, has formal treatment in decision theory (value-of-information, Bayesian diagnosis), and has a widely-understood name only in medicine. Software engineers reinvent it constantly under ad-hoc names (“narrow it down,” “isolate the variable”) because it lacks a snappy term of our own. “Differential diagnosis” actually travels well if you want to borrow it.

What is “differential” doing in “differential diagnosis”? I do not understand its origins.

“Differential” here means pertaining to distinctions — from Latin differentia, “difference.” It’s the logician’s sense, not the everyday one.

In classical definition (Aristotle, Porphyry), you define a thing by genus + differentia: the broad category plus the feature that distinguishes it from other members. Classic example: man = rational (differentia) animal (genus). The differentia is what picks this species out from the rest of its genus.

A differential diagnosis is diagnosis done by that method — identify the genus of the illness (say, “causes of chest pain”), then reason about the differentiae: which feature distinguishes MI from pulmonary embolism from pericarditis? Order tests that reveal those distinguishing features.

Same root sits under other “differential” terms and all share the core meaning concerning differences:

  • differential calculus — math of infinitesimal differences between nearby values
  • differential equation — relates a function to its differences (derivatives)
  • differential gear — lets two wheels rotate at differing rates
  • cell differentiation — cells acquiring distinguishing features

The medical phrase dates to 19th-century German medicine (Differentialdiagnose) and spread from there.

So in the phrase: “differential” is doing the work of “by means of distinctions,” not “different” in the colloquial “varied” sense.

Some tests cost more, in terms of time, money, or hassle. Some tests provide more information. E.g., testing if a number is odd, vs. testing if a number is 42. Also, the information revealed by a test depends on which other tests have been performed.

I have an intuition for all this, from YoE. But it strikes me an intuition is sometimes a poor substitute for an explicit method and understanding.

Is there an established, structured way to think about the optimal sequence of tests, given a menu of tests, where every test has a cost, and tests reveal different info, as I describe above?

Yes. This is a well-studied problem with a real literature. Your intuition is tracking the right structure.

The core framework: value of information (VoI)

From decision theory (Ronald Howard, 1966). Given a decision you must ultimately make under uncertainty, a test is worth running iff:

expected reduction in decision cost from the test’s info > cost of the test

The “expected reduction” accounts for the fact that you don’t know the outcome in advance — you average over possible outcomes weighted by their probabilities under your current beliefs.

Operationally you compute expected information gain per unit cost:

  • Let H be your current hypothesis distribution with entropy H(H).
  • Test T has outcomes t₁, t₂, …
  • Expected entropy after T = Σᵢ P(tᵢ) · H(H | tᵢ)
  • Info gain = H(H) − E[H(H|T)]
  • Score = info_gain(T) / cost(T). Pick the highest.

This is the myopic greedy policy. It’s not globally optimal (truly optimal is NP/#P-hard in general), but:

Why greedy works well: submodularity

Information gain is typically submodular — diminishing returns. Krause & Guestrin and earlier work show that greedy selection of submodular functions gets you within (1 − 1/e) ≈ 63% of optimal. In practice it’s often much closer.

Named pointers to chase

  • Decision-theoretic troubleshooting — Heckerman, Breese, Rommelse (1995, Microsoft Research). The canonical paper explicitly framing fault diagnosis with costs and observations. Google Scholar this — it’s the most directly relevant single paper for the engineering-troubleshooting version of your question.
  • Sequential analysis — Abraham Wald, 1940s. The statistical foundation; SPRT (sequential probability ratio test) gives you optimal stopping rules when you can pick how many tests to run.
  • Bayesian experimental design — Lindley (1956); modern treatments in Chaloner & Verdinelli (1995). Same math as what you’re describing.
  • Generalized binary search / equivalence class edge cutting — Dasgupta et al. Graph-theoretic framing of “pick the test that best separates remaining hypotheses.”
  • Active learning (ML literature) — same problem under a different name: which data point should I label next? Settles (2009) survey is the standard entry.

Your “tests depend on other tests” observation

That’s conditional information gain, and it’s why greedy isn’t truly optimal — a test might look low-value now but enable a later high-value test. Formal optimal policies handle this via dynamic programming over the tree of possible test result sequences (AO* search in AI literature). In practice this explodes fast, so people use greedy with occasional lookahead.

What this buys you beyond intuition

  • Clear stopping criterion: stop when the expected value of any remaining test falls below its cost.
  • Explicit priors: forces you to write down what you think the probability of each hypothesis is, which often exposes that the “obvious” cause is actually not so likely on reflection.
  • Portfolio view: you can formally argue for running an expensive test if cheaper tests have already eliminated enough hypotheses that the expensive one now narrows the remaining space sharply.

It’s exactly what experienced engineers do informally, made explicit. The explicit version wins when hypotheses are many, tests are expensive, or the team’s intuitions disagree and you need a shared basis for deciding what to do next.

I intuited all of this. Do people with CS degrees get taught it?

Honestly: not as a coherent topic, no. The pieces appear in a typical CS curriculum, but they’re scattered and rarely combined into “how to choose the next test.”

What you do get

  • Binary search / bisection — algorithms class. Taught as a data-structure operation, not a diagnostic strategy. git bisect is the closest you get to applying it outside sorted arrays, and usually picked up on the job, not in class.
  • Bayes’ rule — probability/stats class. Often treated abstractly (urns, coin flips), rarely applied to “updating beliefs about what’s wrong with this system.”
  • Shannon entropy — in a compression, ML, or info-theory elective. Presented as a property of distributions, not as “how much a test will tell you.”
  • A*, heuristic search — AI class. About finding a goal state, not about selecting observations.
  • Decision trees — ML class, but almost always as a classifier (CART, ID3), not as “design a sequence of queries.”

What you typically don’t get

  • Value of information as an explicit concept
  • Sequential analysis (Wald, SPRT)
  • Bayesian experimental design
  • Submodular optimization / greedy information-maximization
  • Decision-theoretic troubleshooting (Heckerman et al.)

You can finish a BS in CS without encountering any of these. They live in: grad-level AI, grad-level ML, decision theory / operations research (usually a different department entirely — industrial engineering), applied statistics, reliability engineering.

Where it is taught as a coherent skill

  • Medicine — differential diagnosis is explicitly drilled in med school clinical training. They have the better curriculum on this specific meta-skill.
  • Operations research / IE programs — value-of-information and decision analysis are core.
  • Reliability engineering / FMEA practice — industrial contexts where diagnostic cost matters operationally.
  • Experimental physics and biostatistics — design-of-experiments traditions.

The irony

Debugging is arguably the core software engineering skill, and narrowing-down-under-cost is its heart. But in CS education it’s transmitted informally: internships, open-source contributions, mentorship, “learned the hard way.” The formal framework exists and is teachable; it just isn’t taught to most CS undergrads as debugging method.

So your intuition from YoE isn’t redundant with coursework — it’s actually ahead of what most CS grads are explicitly taught. The formal theory mostly gives you shared vocabulary and edge-case guidance (e.g., when greedy myopia bites you), not a fundamentally new idea.

One value of formalization is that you can actually use it to perform calculations in the unlikely situation where you have numbers to plug in.

The more likely benefit is that it just lets you think more clearly, by making an explicit intuition more available to explicit reasoning processes.

I’m not likely to start computing the entropy of my various root cause theories while debugging.

So what is a slightly lighter-weight formulation of these concepts, which I can use? For instance, when medical doctors are drilled in this, what terms and procedures are they taught?

Yes — and this is actually the version most doctors use in practice, even the ones who once learned the equations. Almost all clinical reasoning runs on qualitative Bayesianism: no numbers, but the machinery of updating-under-uncertainty is there, with a vocabulary tight enough to use at speed.

The vocabulary that carries the weight

  • Differential (short for differential diagnosis) — the current list of candidate explanations, ordered by rough likelihood. “What’s on your differential?” = “what hypotheses are you considering?”
  • Pre-test probability — your belief in a diagnosis before running a test. Estimated qualitatively: low / moderate / high.
  • Post-test probability — belief after the result.
  • Rule in / rule out — the purpose of a given test. Are you confirming a leading hypothesis, or eliminating alternatives?
  • Likelihood ratio (LR) — the clinician’s lightweight Bayes. “LR > 10 rules in; LR < 0.1 rules out; in between, shifts you a bit.” You don’t compute; you recognize which bucket a test lives in.
  • Sensitivity / specificity, mnemonics SnNout (Sensitive test, Negative → rules OUT) and SpPin (Specific test, Positive → rules IN). How to use a test depends on which of these two properties it has.
  • Pathognomonic — a finding unique to one diagnosis. Smoking gun. Rare, but ends the differential immediately.
  • Must-not-miss — a diagnosis included on the differential despite low probability because missing it is catastrophic. The cost side of value-of-information, reduced to a label.
  • Common things are common / “hoofbeats → horses, not zebras” — respect base rates. An aphorism doing the work of a prior.
  • Occam’s razor vs Hickam’s dictum — pressure to unify (one cause explains everything) vs. permission to diverge (“patients can have as many diseases as they damn well please”). Taught as a tension you navigate, not a rule.

The cognitive failure modes, explicitly named

Because intuition is fallible, clinicians are drilled on the errors:

  • Anchoring — fixating on the first impression.
  • Premature closure — stopping the differential before it’s adequately wide.
  • Availability bias — overweighting recent or memorable cases.
  • Confirmation bias — ordering tests likely to confirm your leading theory rather than discriminate.

Naming these turns them from invisible bugs into things your partner can catch you doing. (“You’re anchoring.”)

The procedure they use

  1. Chief complaint — define the genus (what problem are we solving?).
  2. Generate differential — force enumeration, often with a mnemonic like VINDICATE (Vascular, Infectious, Neoplastic, Degenerative, Iatrogenic, Congenital, Autoimmune, Traumatic, Endocrine, …) to avoid anchoring. The mnemonics aren’t about the categories — they’re anti-premature-closure prompts.
  3. Rough-rank by prior — common + consistent first, zebras lower, must-not-miss always included.
  4. Pick next test — chosen to discriminate between top candidates; cheap/safe before expensive/invasive; use rule-in or rule-out framing to clarify purpose.
  5. Update — shift candidates up/down the differential qualitatively.
  6. Stop when you’ve crossed the treatment threshold — leading hypothesis is probable enough to act on, or any remaining uncertainty doesn’t change what you’d do next.

Translated to debugging

The transplant is clean:

  • Differential = your list of candidate causes.
  • Pre/post-test probability = your belief before/after running a diagnostic.
  • Rule in / rule out = is this check confirming a theory or eliminating one? (Most debugging tests are weak because they only confirm.)
  • Must-not-miss = “could this be a silent corruption issue?” — you check even at low prior because the cost of missing is high.
  • Pathognomonic = the signature error message that uniquely identifies a bug class.
  • Occam vs Hickam = is this one bug or two? Symptom clusters can mislead.
  • Anchoring / premature closure = the thing a tired engineer does at 2am.
  • Treatment threshold = “I’m confident enough to ship the fix” — separate question from “I’ve exhausted the differential.”

That vocabulary alone — even without any numbers — usually tightens a team’s debugging conversations noticeably. It gives people a way to say “you’re anchoring,” “what’s on your differential?”, or “is this rule-in or rule-out?” without sounding weird.

This is great! Why have I not encountered this before?

A few real reasons, mostly structural:

Disciplinary silo. Medicine and software engineering barely cross-pollinate. The vocabulary sits in someone else’s core curriculum, and CS textbook authors rarely draw from clinical reasoning. The concepts aren’t hidden; they’re just parked in a neighborhood you don’t pass through.

Debugging is taught by osmosis, not pedagogy. No standard CS program has a “Debugging 101.” The skill transmits through mentorship, pair programming, and pain. When transmission is informal, vocabulary stays informal — you end up with “narrow it down” and “isolate the variable” instead of “rule out” and “discriminate.”

Software engineering has no rounds. Medicine has 2500+ years of explicit clinical teaching — oral examinations, grand rounds, case reports, M&M conferences. The senior attending walking a resident through a differential out loud is the core transmission mechanism, and it forces articulation of tacit reasoning. SWE’s closest analog is code review, which focuses on artifacts, not inference. Post-mortems are the right shape but only became mainstream (in blameless form) in the 2010s, and most still emphasize “what broke” over “how we reasoned.”

Code is inspectable; bodies aren’t. Medicine’s formalism was forced by the inability to just look inside the patient. Engineers can usually printf their way through it, so inference-under-uncertainty feels optional — even when it’s the right tool (reproducing heisenbugs, diagnosing prod at scale, debugging distributed systems).

SWE tests feel free. Rerun the build, check the logs — most diagnostic actions feel costless. Medicine developed rigorous VoI thinking partly because biopsies hurt and CT scans cost thousands. We only encounter cost-conscious debugging in narrow contexts: expensive-to-reproduce bugs, long training runs, prod incidents where every minute counts. So budget-aware test selection doesn’t become a general cultural practice.

SWE culture imports from math and logic, not statistics and decision theory. We’re comfortable talking about complexity classes and type systems. We’re much less comfortable talking about priors and likelihood ratios. This is partly historical — PL theory has more prestige in CS than applied stats does. No shared Bayesian dialect means no shared vocabulary for diagnostic reasoning.

Senior engineers often can’t articulate what they do. Polanyi’s “tacit knowledge” — masters can’t always name their moves. Medicine institutionalized articulation. SWE didn’t.

Where it’s starting to appear: SRE / incident response culture has the shape right. You’ll hear “what’s your top suspect?” and “what would rule this out?” in a mature on-call rotation. The Google SRE book and good incident review practices are the closest the field has to formalized diagnostic reasoning. But the medical framing — with named cognitive biases and explicit pre/post-test probabilities — is still rare.

Reading that partly closes the gap, if you want it:

  • Zeller, Why Programs Fail — most systematic book-length treatment of debugging as hypothesis-test cycles. Closest thing to a clinical reasoning textbook for SWEs.
  • Agans, Debugging: The Nine Indispensable Rules — practical, not Bayesian, but articulates the tacit knowledge.
  • Google SRE book (O’Reilly) — chapter on troubleshooting formalizes the incident-response version.
  • Kahneman, Thinking, Fast and Slow — not about debugging, but the named cognitive biases (anchoring, availability, confirmation) are exactly the medical vocabulary transplanted into general reasoning.

So: not hidden, not mysterious, not beyond the field’s grasp. Just un-taught, un-named in our vocabulary, and carried as tacit knowledge by experienced engineers — who usually can’t explain what they’re doing either.

Refresh me on LR, please.

Likelihood ratio in one paragraph

An LR tells you how much a test result shifts your belief in a diagnosis. Formally:

  • LR+ (for a positive result) = P(test+ | disease) / P(test+ | no disease) = sensitivity / (1 − specificity)
  • LR− (for a negative result) = P(test− | disease) / P(test− | no disease) = (1 − sensitivity) / specificity

The update rule (why it’s lightweight Bayes)

Work in odds, not probabilities:

post-test odds = pre-test odds × LR

That’s it. Bayes’ rule with the arithmetic hidden. You don’t need to remember the full formula — just: multiply your current odds by the LR to get the new odds.

Convert odds ↔ probability only at the endpoints if you want to.

The interpretive buckets clinicians memorize

LR+What it does to your belief
> 10Large increase — usually enough to rule in
5–10Moderate
2–5Small
1–2Trivial
= 1Useless test

For LR−, invert — values near 0 (< 0.1) rule out; values near 1 are useless.

Why it’s better than sensitivity/specificity alone

Sensitivity and specificity are properties of the test. LR is the property that tells you how to use the test’s result to update. A clinician asks “what’s the LR?” precisely because that’s the single number that maps result → belief shift.

Chaining

Independent tests compose by multiplication:

post-odds = pre-odds × LR₁ × LR₂ × …

This is why doctors can run a small battery and accumulate confidence without ever touching a probability calculation — you just multiply LRs along the way.

Debugging translation

A diagnostic action in debugging has an implicit LR. When you run nc -u from the Mac and no packet arrives, your LR for “eero port-forward is working” is very low — that single observation crushes the hypothesis. When your hello does arrive after recreating the rule, the LR for “stale rule was the cause” spikes. You don’t compute; you recognize which bucket the evidence lives in. That’s what clinicians are doing, too.

Do any of those books you mentioned spell out the analogy with medical terminology as you have done?

Honest answer: no, not as I laid it out. The analogy to medical reasoning appears in fragments across those books, but none of them builds it explicitly as a full transplant of clinical vocabulary into debugging.

Going through them:

  • Zeller, Why Programs Fail — formalizes debugging as the scientific method (hypothesis → experiment → observation → refinement). Rigorous, but the vocabulary is philosophy of science (Popperian falsification, induction/deduction), not medicine. No “differential,” no “rule in/out,” no LR.
  • Agans, The 9 Indispensable Rules — aphoristic craft wisdom (“Make it fail,” “Quit thinking and look,” “Change one thing at a time”). Zero theoretical framing. No clinical language.
  • Google SRE book, Ch. 12 “Effective Troubleshooting” — closest among the four. Uses the words “triage” and “hypothesis” explicitly, explicitly names cognitive biases (anchoring, confirmation), and I believe mentions “differential diagnosis” in passing as an analogy. But it doesn’t systematically develop the vocabulary — it gestures at the comparison rather than porting the framework.
  • Kahneman — about biases generally, not debugging specifically. You’d do the transplant yourself.

Other places that get closer than those four:

  • Brendan Gregg, Systems Performance — develops explicit troubleshooting methodologies (USE method, TSA, RED). Not clinical vocabulary, but the same “structured procedure to narrow causes” spirit. Closest thing in SWE to taught diagnostic method.
  • Atul Gawande, The Checklist Manifesto — written by a surgeon and explicitly bridges medicine and other fields (including engineering and aviation). But it’s about checklists specifically, not differential reasoning. Relevant for “structured thinking imported from medicine” but not the same transplant.
  • Academic papers on “debugging as abductive diagnosis” — there’s a small literature in AI/automated debugging (e.g., Reiter 1987, “A Theory of Diagnosis from First Principles”) that formalizes debugging using diagnostic reasoning from logic and AI. But it’s not written for practitioners and uses logical-model vocabulary rather than clinical.

So the specific transplant — this clinical vocabulary (differential, pre/post-test probability, LR, rule in/out, pathognomonic, SnNout/SpPin, anchoring, premature closure, Occam vs Hickam) applied systematically to software debugging — I pulled together in this conversation from my general knowledge of clinical reasoning pedagogy. I don’t know of a book that does it.

Which is to say: this might be a real gap in the SWE literature. If you wrote it up, I don’t think you’d be duplicating an existing work.


Share this post on:

Previous Post
Sparky Miles, from 1920
Next Post
Sparky at NVIDIA GTC: Face is Interface