Skip to content

CONCEPT Cited by 1 source

Self-inflicted DoS

Definition

A self-inflicted denial of service is an outage where the saturating traffic originates from an internal, trusted client — not from an external attacker — and is typically valid, well-formed, and syntactically indistinguishable from legitimate traffic. The client's intent is benign; the outage is produced by a mismatch between query cost (CPU / memory / I/O per request) and the monitoring regime, which is almost always gated on request volume rather than cost.

The signature is:

  1. A trusted service inside the perimeter starts issuing queries at a rate that is "nothing" compared to normal inbound traffic (20–100 req/s against a cluster handling thousands).
  2. The queries are syntactically valid and semantically legitimate — a rate-limiter or WAF can't distinguish them from normal traffic.
  3. The queries' per-request cost is pathologically high (a high-cardinality aggregation, a scatter-gather with no selective filter, a recursive computation, an unbounded LIKE '%...%' scan).
  4. The coordinator / thread-pool / buffer-pool fills up; tail latency spikes; normal traffic is starved.
  5. Classic infrastructure dashboards (overall QPS, overall error rate, CPU, memory) say "the cluster is just busy" — they do not indicate a specific caller.

The canonical wiki anchor is sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search:

"It was later discovered that the root cause of the issue was a self-inflicted Denial of Service (DoS) attack. As a result of a maintenance workload coupled with a bug in the processing logic of the application, the internal client application was sending a small, but sufficient number of parallel overwhelming faceting queries to the Elasticsearch cluster."

Why the volume × cost mismatch is the diagnostic pivot

Rate limiters and volume-based alerts project all traffic onto a single scalar — requests per second. But the load a cluster actually bears is cost × count, and the cost dimension is wildly variable:

Query shape Relative cost Typical volume-based alarm behaviour
Primary-key point lookup Caught by any alert (high volume tolerated)
Indexed range scan (selective) ~10× Caught if tail latency instrumented
Full-text bool query (cached) ~10–100× Typically fine
Faceting aggregation on a cardinality-~M field (e.g. brand, size) ~100× Cached at coordinator, tolerable
Faceting aggregation on a cardinality-~100M field (e.g. SKU) ~10,000×+ Invisible to volume alerts because only a few per second fit before the cluster saturates

At the top end of this table, a caller at 20–100 req/s is "1-3% of normal cluster inbound" by volume, but can consume 100% of the coordinator CPU budget simultaneously. The volume-altitude monitoring will not register the caller because it is dwarfed by millions of legitimate cheap queries.

Load-bearing preconditions for this failure mode

  1. A shared backend serving many internal callers (search cluster, database, coordinator service) where queries are generic and caller identity is not load-bearing in the authorisation model.
  2. No per-query cost model at the client boundary — the client library or gateway accepts arbitrary queries without flagging high-cardinality aggregations, unbounded scans, or missing selective predicates.
  3. No per-client slow-query attribution. The slow-query log exists but does not record which caller sent the slow query, so recurrent pathological callers are invisible. Zalando's explicit remediation was propagating an X-Opaque-Id header at the Elasticsearch request boundary.
  4. An internal caller that is both trusted and automated — the bug path must be triggerable without a human in the loop, because humans correcting themselves in real time don't produce multi-hour incidents.

Why horses-not-zebras debugging misses this

When cluster CPU spikes, the playbook-ordered hypotheses are:

  1. Recent deploy regression on the cluster → Zalando: no recent deploys.
  2. Write load spike → Zalando: write load normal.
  3. External traffic spike → Zalando: inbound QPS normal.
  4. Infrastructure fault (node failure, AZ degradation) → Zalando: other clusters fine.
  5. Misconfiguration / GC pause / JVM issue → Zalando: no.

All five are horses — the common causes. Self-inflicted DoS is the zebra: rarer, not in the first-line playbook, not explained by the metrics the first-line playbook reads. Canonical zebra lesson.

Remediation levers

Lever Canonical example Where it lives
Per-client cost attribution concepts/x-opaque-id-client-attribution + patterns/per-client-slow-query-dashboard Client HTTP header + slow-query-log pipeline
Application-side query limits patterns/application-side-query-limit-with-dynamic-threshold Query-builder layer (before hitting the shared backend)
Cluster-wide aggregation guardrails search.max_buckets / patterns/cluster-wide-aggregation-guardrail Elasticsearch cluster setting
Per-client workload isolation Per-tier thread pools / patterns/tier-tagged-query-isolation / patterns/route-tagged-query-isolation Coordinator-layer scheduling
Market / cell split to contain blast radius patterns/split-cluster-by-market-for-load-isolation / concepts/market-group-country-isolation Cluster topology
Trace-altitude per-caller anomaly detection systems/lightstep notebook-exploration workflow APM / tracing backend

The Zalando post specifically names "rate limiting based on the type of the client traffic. Not all clients should be equal" as the follow-up direction.

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Internal Zalando application, triggered by an automated maintenance workload plus a processing-logic bug, sent 20–100 req/s of high-cardinality terms aggregations on the SKU field to an Elasticsearch cluster handling thousands of req/s of normal traffic. Cluster starved, coordinator CPU pinned, filters broken across two of the largest markets; mitigated by a 5-lever app-side load-shed + a structural market-split via node-allocation-based cluster split. Root cause identified via a Lightstep trace-exploration notebook that spotted the caller running at 50× its normal volume.

Contrast with external DoS

Dimension External DoS Self-inflicted DoS
Source Hostile / unknown Trusted internal service
Query validity Often malformed / exploratory Syntactically valid + semantically legitimate
Volume Usually high Often very low vs baseline
Per-query cost Variable Pathologically high
Defence WAF / rate limit / blocklist App-side cost limit + per-caller attribution
Detection Volume-based alarms Per-caller cost-weighted alarms

External DoS defences (WAFs, IP rate limits, anomaly-detection on request headers) do not catch self-inflicted DoS because the attacker is inside the authenticated perimeter and their requests look normal.

Last updated · 507 distilled / 1,218 read