Skip to content

PATTERN Cited by 1 source

Distribute DNS load to host resolver

When a central DNS-server cluster saturates a per-ENI resolver rate limit (canonical case: the AWS VPC resolver 1,024-pps-per-ENI cap), the fix is topological, not rate-negotiated: move the DNS-forwarding workload off the central cluster onto every application host's local resolver, so the workload is spread across N ENIs instead of concentrated on a few.

Mechanism

  1. Run Unbound (or equivalent) on every application host, in addition to any central cluster.
  2. Add per-zone forwarding rules on the local Unbound so the saturating zone goes directly to the upstream (e.g. VPC resolver), bypassing the central cluster for that zone.
  3. Keep the central cluster as the resolver for other zones that require service-discovery logic, split-horizon routing, or policy filtering.

Optionally split the saturating zone into fast-path and slow-path sub-zones so the resolver's smoothed-RTT retry timeout state stays independent per rule — a resolver's retry calculation for 10.in-addr.arpa. (private, fast) should not be poisoned by the tail latency of .in-addr.arpa. (public, slow).

Effective ceiling

effective_pps = per_eni_cap × host_count

For Stripe's case: 1,024 pps × thousands of Hadoop hosts ≈ multi-million pps effective ceiling vs. the ~10K-pps ceiling of a central cluster on ~10 ENIs.

Trade-offs

  • More processes to monitor. Every host now has a DNS resolver to observe; an incident on any one host is less severe but may go unnoticed.
  • Cache warm-up time. Local caches warm independently; fleet-wide cache warm-up is slower than warming a central cluster.
  • Observability split. DNS telemetry is now per-host rather than per-cluster; aggregation is the consumer's problem.
  • Configuration drift risk. Per-host configuration needs a fleet-wide rollout mechanism (Chef / Ansible / similar).

Seen in

  • Stripe — The secret life of DNS packets (2024-12-12). Canonical wiki instance. Stripe added forwarding rules on every Hadoop host's local Unbound for in-addr.arpa (reverse DNS) directly to the VPC resolver, split into 10.in-addr.arpa. (private, fast, Route 53) and the generic .in-addr.arpa. (public, slow, upstream nameservers). After rollout the hourly VPC-resolver saturation disappeared and SERVFAIL spikes went away.
Last updated · 470 distilled / 1,213 read