CONCEPT Cited by 1 source
Zebra-not-horse heuristic¶
Definition¶
The zebra-not-horse heuristic is the oncall-investigator's bias-checker for ruled-out-all-the-common-causes situations. It inverts the clinical-diagnostics aphorism
"When you hear hoofbeats, think horses, not zebras"
— the default teaching that common explanations should be tested before rare ones — to name the complementary failure mode: the investigator who keeps cycling through the common hypotheses never finds the actual cause when the actual cause is rare.
The wiki-anchoring framing, from Zalando's Search & Browse team:
"Why wasn't this detected earlier? Because we were looking for a horse. You know that old saying about the horse and the zebra? When you hear hoofbeats, think of horses, not zebras. Because horses are common, and zebras are rare. But in our case, it happened to be a zebra. We were looking for common causes of Elasticsearch performance issues: high read load, high write load, misconfigurations, infrastructure issues. We were not expecting a self-inflicted DoS attack from an internal application. So keep in mind: sometimes, when you hear hoofbeats, it might just be a zebra."
— Source: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search
When to invoke the zebra check¶
The heuristic is not "consider rare causes first." It is: when the common hypotheses have been eliminated by evidence and the symptom persists, stop re-testing the common hypotheses and actively enumerate rare ones.
Signs an investigation has entered zebra territory:
- The first-line playbook is exhausted. Recent deploys: none. Traffic spike: no. Infrastructure fault: no. Write-path change: no. GC pause: no.
- The symptom is persistent or recurrent, not transient.
- The common-cause hypotheses don't explain the available evidence — not just "haven't been proven," but actively contradicted by what metrics / logs show.
- The team is cycling — re-running the same top-5 checks, re-reading the same dashboards, not producing new evidence.
- Cross-dimensional anomaly is present — e.g. one cluster affected, peer clusters fine; one market group affected, others healthy. This tells you the common cause (which would affect everyone) is unlikely.
Zalando's five-bullet zebra catalog¶
The 2025-12-16 post-mortem publishes the five reasons the zebra hid — this is a reusable template for writing up the gap after a zebra has been found:
(1) The queries were legitimate in terms of syntax and structure. (2) The service sending them was an internal application, legit, and not new. (3) The load was very low in terms of volume — 20–100 req/s vs baseline thousands. (4) Slow queries were monitored but not analysed in depth. (5) Slow queries had no identifier linking them back to the calling service.
Each bullet identifies a specific sensor-or-attribution gap that made the zebra undetectable. The retrospective's prescriptive value is that each gap becomes a follow-up engineering item — the zebra-gap catalog becomes the backlog.
Contrast with "horses, not zebras"¶
The original aphorism (attributed to Theodore Woodward, ~1940s Maryland medical school) is defensive against the "fascinating rare diagnosis" trap — residents over-weighting zebras at the cost of missing the common diagnosis sitting in front of them. The zebra-not-horse heuristic does not contradict this — it applies at a different phase of the investigation:
| Phase | Heuristic | What it warns against |
|---|---|---|
| Initial hypothesis generation | Horses first | Don't skip common causes to show off rare-disease knowledge |
| Post-playbook-exhaustion | Zebras too | Don't re-test eliminated common causes; start enumerating rare ones |
Mature oncall culture internalises both — horses first, zebras when the horses are ruled out.
Load-bearing tools for zebra hunts¶
Once the investigator accepts the zebra framing, the tools that help are the ones that cut the problem on a non-obvious axis:
- Trace-altitude exploration (rather than metric-altitude) — Zalando's root-cause pivot came from a Lightstep notebook that spotted a per-caller fan-out anomaly invisible in cluster-aggregate dashboards.
- Per-client / per-caller cross-sections — see patterns/per-client-slow-query-dashboard and patterns/actor-tagged-query-observability. Aggregate metrics hide zebras; per-caller cross-sections surface them.
- Reading the code (patterns/read-the-code-for-partial-failure-bugs) — when metrics don't explain the failure, the source code of the caller (or the server) often does.
- Compare-healthy-vs-sick partitions — when one cluster is affected and peers are fine, the difference between them contains the answer. Zalando's market-split mitigation simultaneously served as a diagnostic wedge: splitting the markets proved the problem was localised to one market's traffic, not the infrastructure.
Seen in¶
- sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Zalando's Search & Browse team closes their post-mortem with the explicit zebra framing as the core operational lesson. The incident is the worked example: five first-line hypotheses eliminated, hours of cluster thrash before the team accepted they were looking at a zebra, and a Lightstep notebook — a trace-altitude tool used outside the metric-altitude playbook — produced the identification.
Related¶
- concepts/self-inflicted-dos — the specific zebra this heuristic pattern-matched in the canonical anchor
- concepts/postmortem-as-data-goldmine — the broader Zalando institutional-memory ethos the zebra framing lives inside
- concepts/surface-attribution-error — a distinct LLM-debugging pitfall about plausible-but-wrong attributions
- patterns/read-the-code-for-partial-failure-bugs — a companion zebra-hunt tool