Skip to content

CONCEPT Cited by 1 source

DNS request amplification via retries

When DNS queries are routed through multiple layers of resolvers (e.g. application → local per-host resolver → central cluster resolver → upstream authoritative), each layer independently applies its own timeout-and-retry logic. If the deepest upstream slows down or drops queries, each layer retries independently, and the outbound traffic observed at the upstream multiplies by the per-layer retry count.

Mechanism

  • Client retries configured to 5× per failed query.
  • On-host resolver (e.g. Unbound) has its own smoothed-RTT-based retry timeout; on slow upstream it retries independently of the client.
  • Central cluster resolver does the same to its upstream.
  • Net multiplier is approximately the product of per-layer retry counts, not the sum.

Failure amplifies multiplicatively, not additively. Once the upstream starts failing, the traffic volume hitting it goes up — the exact opposite of the load-shedding the failure should trigger.

Seen in

  • Stripe — The secret life of DNS packets (2024-12-12). Stripe measured an average ~7× amplification of the underlying query volume during saturation events: client retry (5×) plus local and cluster-level resolver retries compounded on a slow VPC resolver. This amplification is what took the system from "VPC resolver is a bit slow" to "VPC resolver packet-rate cap is saturated and all DNS fails."

Mitigation

  • Distribute resolver load so the upstream isn't saturated in the first place. See patterns/distribute-dns-load-to-host-resolver.
  • Separate forwarding rules for fast-path (private) and slow-path (public) zones so the smoothed-RTT retry timeout for the fast path isn't poisoned by slow-path latency.
  • Tune retry counts and timeouts down. Default per-layer retry counts are often overly generous for a given workload.
  • Observe the amplification. Packet-counter-based rate metrics (see patterns/iptables-packet-counter-for-rate-metric) expose the outbound packet rate even when query-count metrics look normal.
Last updated · 470 distilled / 1,213 read