CONCEPT Cited by 1 source
Self-inflicted DoS¶
Definition¶
A self-inflicted denial of service is an outage where the saturating traffic originates from an internal, trusted client — not from an external attacker — and is typically valid, well-formed, and syntactically indistinguishable from legitimate traffic. The client's intent is benign; the outage is produced by a mismatch between query cost (CPU / memory / I/O per request) and the monitoring regime, which is almost always gated on request volume rather than cost.
The signature is:
- A trusted service inside the perimeter starts issuing queries at a rate that is "nothing" compared to normal inbound traffic (20–100 req/s against a cluster handling thousands).
- The queries are syntactically valid and semantically legitimate — a rate-limiter or WAF can't distinguish them from normal traffic.
- The queries' per-request cost is pathologically high (a
high-cardinality aggregation, a scatter-gather with no
selective filter, a recursive computation, an unbounded
LIKE '%...%'scan). - The coordinator / thread-pool / buffer-pool fills up; tail latency spikes; normal traffic is starved.
- Classic infrastructure dashboards (overall QPS, overall error rate, CPU, memory) say "the cluster is just busy" — they do not indicate a specific caller.
The canonical wiki anchor is sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search:
"It was later discovered that the root cause of the issue was a self-inflicted Denial of Service (DoS) attack. As a result of a maintenance workload coupled with a bug in the processing logic of the application, the internal client application was sending a small, but sufficient number of parallel overwhelming faceting queries to the Elasticsearch cluster."
Why the volume × cost mismatch is the diagnostic pivot¶
Rate limiters and volume-based alerts project all traffic onto a single scalar — requests per second. But the load a cluster actually bears is cost × count, and the cost dimension is wildly variable:
| Query shape | Relative cost | Typical volume-based alarm behaviour |
|---|---|---|
| Primary-key point lookup | 1× | Caught by any alert (high volume tolerated) |
| Indexed range scan (selective) | ~10× | Caught if tail latency instrumented |
| Full-text bool query (cached) | ~10–100× | Typically fine |
| Faceting aggregation on a cardinality-~M field (e.g. brand, size) | ~100× | Cached at coordinator, tolerable |
| Faceting aggregation on a cardinality-~100M field (e.g. SKU) | ~10,000×+ | Invisible to volume alerts because only a few per second fit before the cluster saturates |
At the top end of this table, a caller at 20–100 req/s is "1-3% of normal cluster inbound" by volume, but can consume 100% of the coordinator CPU budget simultaneously. The volume-altitude monitoring will not register the caller because it is dwarfed by millions of legitimate cheap queries.
Load-bearing preconditions for this failure mode¶
- A shared backend serving many internal callers (search cluster, database, coordinator service) where queries are generic and caller identity is not load-bearing in the authorisation model.
- No per-query cost model at the client boundary — the client library or gateway accepts arbitrary queries without flagging high-cardinality aggregations, unbounded scans, or missing selective predicates.
- No per-client slow-query attribution. The slow-query log
exists but does not record which caller sent the slow query,
so recurrent pathological callers are invisible.
Zalando's explicit remediation was propagating an
X-Opaque-Idheader at the Elasticsearch request boundary. - An internal caller that is both trusted and automated — the bug path must be triggerable without a human in the loop, because humans correcting themselves in real time don't produce multi-hour incidents.
Why horses-not-zebras debugging misses this¶
When cluster CPU spikes, the playbook-ordered hypotheses are:
- Recent deploy regression on the cluster → Zalando: no recent deploys.
- Write load spike → Zalando: write load normal.
- External traffic spike → Zalando: inbound QPS normal.
- Infrastructure fault (node failure, AZ degradation) → Zalando: other clusters fine.
- Misconfiguration / GC pause / JVM issue → Zalando: no.
All five are horses — the common causes. Self-inflicted DoS is the zebra: rarer, not in the first-line playbook, not explained by the metrics the first-line playbook reads. Canonical zebra lesson.
Remediation levers¶
| Lever | Canonical example | Where it lives |
|---|---|---|
| Per-client cost attribution | concepts/x-opaque-id-client-attribution + patterns/per-client-slow-query-dashboard | Client HTTP header + slow-query-log pipeline |
| Application-side query limits | patterns/application-side-query-limit-with-dynamic-threshold | Query-builder layer (before hitting the shared backend) |
| Cluster-wide aggregation guardrails | search.max_buckets / patterns/cluster-wide-aggregation-guardrail |
Elasticsearch cluster setting |
| Per-client workload isolation | Per-tier thread pools / patterns/tier-tagged-query-isolation / patterns/route-tagged-query-isolation | Coordinator-layer scheduling |
| Market / cell split to contain blast radius | patterns/split-cluster-by-market-for-load-isolation / concepts/market-group-country-isolation | Cluster topology |
| Trace-altitude per-caller anomaly detection | systems/lightstep notebook-exploration workflow | APM / tracing backend |
The Zalando post specifically names "rate limiting based on the type of the client traffic. Not all clients should be equal" as the follow-up direction.
Seen in¶
- sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search
— canonical wiki instance. Internal Zalando application, triggered
by an automated maintenance workload plus a processing-logic bug,
sent 20–100 req/s of high-cardinality
termsaggregations on the SKU field to an Elasticsearch cluster handling thousands of req/s of normal traffic. Cluster starved, coordinator CPU pinned, filters broken across two of the largest markets; mitigated by a 5-lever app-side load-shed + a structural market-split via node-allocation-based cluster split. Root cause identified via a Lightstep trace-exploration notebook that spotted the caller running at 50× its normal volume.
Contrast with external DoS¶
| Dimension | External DoS | Self-inflicted DoS |
|---|---|---|
| Source | Hostile / unknown | Trusted internal service |
| Query validity | Often malformed / exploratory | Syntactically valid + semantically legitimate |
| Volume | Usually high | Often very low vs baseline |
| Per-query cost | Variable | Pathologically high |
| Defence | WAF / rate limit / blocklist | App-side cost limit + per-caller attribution |
| Detection | Volume-based alarms | Per-caller cost-weighted alarms |
External DoS defences (WAFs, IP rate limits, anomaly-detection on request headers) do not catch self-inflicted DoS because the attacker is inside the authenticated perimeter and their requests look normal.
Related¶
- concepts/high-cardinality-aggregation-overload — the specific per-query mechanism at Elasticsearch
- concepts/x-opaque-id-client-attribution — the observability primitive that closes the attribution gap
- concepts/zebra-not-horse-heuristic — the debugging mental model that surfaces this failure mode
- concepts/blast-radius — what structural isolation bounds
- concepts/tail-latency-spike-during-queueing — the user-visible symptom
- concepts/load-shedding-at-ingestion — the generalised parent concept
- patterns/application-side-query-limit-with-dynamic-threshold
- patterns/per-client-slow-query-dashboard
- patterns/cluster-wide-aggregation-guardrail
- patterns/split-cluster-by-market-for-load-isolation
- systems/elasticsearch