Skip to content

PATTERN Cited by 1 source

Application-side query limit with dynamic threshold

The pattern

Application-side query limit with dynamic threshold is an admission-control discipline where the query-builder / client-library layer — upstream of the shared backend — inspects each query's shape and rejects (or caps) those that would be unduly expensive, before the query ever reaches the backend. The thresholds are runtime-tunable (via config service, feature flag, or hot-reloadable rules file) so operators can tighten them during incidents and loosen them when capacity recovers.

The defining properties:

  1. App-side, not backend-side. The check runs in the caller's process, not in the shared backend's admission code. This protects the backend from ever seeing the pathological query, avoiding the case where the backend's own admission control is already too busy to run.
  2. Shape-based cost inspection, not volume. The decision criterion is "how expensive would this query be?" — field cardinality, aggregation bucket count, missing selective predicates, wildcard-leading LIKE — not QPS.
  3. Dynamic threshold. Operators can change the cap without a code deploy. This is load-bearing — fixed thresholds are either too loose (don't stop pathological traffic) or too tight (reject legitimate business queries), and the right number depends on cluster capacity at the moment.

The canonical wiki anchor is Zalando's Search & Browse team's 2025-12-16 follow-up list:

"We introduced application-side query limiting with dynamically adjustable thresholds, to prevent queries that would try to scan or aggregate too much data." (Source: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.)

What the inspection layer checks

Per-query cost predictors the app-side layer can evaluate before submitting the query:

Predictor Why it predicts cost Example cap
Aggregation field cardinality terms-on-unique-ID is the canonical pathology max_cardinality < 10^6 for facet fields
Aggregation bucket count Linear in memory + coordinator merge cost max_buckets ≤ 10,000 per request
Result window (from + size) Linear in coordinator memory from + size ≤ 10,000
Filter selectivity Unfiltered scans hit all shards Require at least one index-friendly filter
Wildcard-leading patterns Cannot use term index Reject LIKE '%foo' or *foo* wildcard queries
Nested query depth Compounds per-shard work Cap nesting depth
Scroll size × ttl Long-lived scroll contexts pin resources Cap scroll TTL

For each predictor, a static threshold catches egregious cases; the dynamic threshold catches the marginal cases that depend on current cluster health.

Relation to cluster-side guardrails

This pattern is complementary to, not a replacement for, cluster-side guardrails like Elasticsearch's search.max_buckets (see patterns/cluster-wide-aggregation-guardrail).

Lever Lives at Protects Cost of rejection
App-side limit Query-builder layer Backend from ever seeing the query Low — client gets a clean error locally
search.max_buckets ES cluster setting Coordinator from unbounded bucket count Medium — request already accepted, work done before rejection
token-bucket slow-query limiter Observability path Monitoring pipeline from storm High — not an admission primitive; it's a telemetry rate-cap

The defence-in-depth stance is to run both — app-side for early rejection of shape-pathological queries, cluster-side as the final guardrail for queries that slip through (including those from callers that bypass the app-side layer).

Why dynamic thresholds matter

Static thresholds fail in both directions:

  • Too loose: the threshold is set at "what we currently see in production" which is exactly what a production incident has to exceed to be an incident. The threshold can't stop the next incident.
  • Too tight: legitimate business queries (analytics users, partner exports, end-of-quarter reporting) get rejected. The business blames the reliability team.

Dynamic thresholds resolve the tension:

  • Steady state: loose threshold — legitimate business queries pass, only pathological outliers rejected.
  • Degraded state: operator tightens the threshold during incident — rejects queries the cluster would normally absorb but currently cannot.
  • Recovery: operator loosens the threshold as capacity returns.

The mechanism that makes this operable is a hot-reloadable config path for the thresholds — feature flag service, etcd watch, ConfigMap + SIGHUP — so no deploy is required mid-incident.

Interaction with per-client attribution

The pattern pairs naturally with per-client slow-query dashboards via X-Opaque-Id:

  1. Dashboard identifies the pathological caller.
  2. Threshold is tightened for that caller — the dynamic threshold can be per-caller-class, not just global.
  3. Legitimate callers continue unaffected.

Without per-caller attribution, the operator has to choose between tightening globally (punishing innocent callers) or leaving the bad caller alone. With attribution, the dynamic threshold becomes a targeted weapon.

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Follow-up engineering action after the 2025-12-16 self-inflicted DoS. The Zalando Search & Browse team added dynamically adjustable query-cost thresholds in the app-side query-builder layer specifically to prevent "queries that would try to scan or aggregate too much data." Paired with X-Opaque-Id client attribution and new per-client slow-query dashboards in a three-piece post-incident defence.
Last updated · 507 distilled / 1,218 read