PATTERN Cited by 1 source
Distribute DNS load to host resolver¶
When a central DNS-server cluster saturates a per-ENI resolver rate limit (canonical case: the AWS VPC resolver 1,024-pps-per-ENI cap), the fix is topological, not rate-negotiated: move the DNS-forwarding workload off the central cluster onto every application host's local resolver, so the workload is spread across N ENIs instead of concentrated on a few.
Mechanism¶
- Run Unbound (or equivalent) on every application host, in addition to any central cluster.
- Add per-zone forwarding rules on the local Unbound so the saturating zone goes directly to the upstream (e.g. VPC resolver), bypassing the central cluster for that zone.
- Keep the central cluster as the resolver for other zones that require service-discovery logic, split-horizon routing, or policy filtering.
Optionally split the saturating zone into fast-path and
slow-path sub-zones so the resolver's smoothed-RTT retry
timeout state stays independent per rule — a resolver's retry
calculation for 10.in-addr.arpa. (private, fast) should not
be poisoned by the tail latency of .in-addr.arpa. (public,
slow).
Effective ceiling¶
effective_pps = per_eni_cap × host_count
For Stripe's case: 1,024 pps × thousands of Hadoop hosts ≈ multi-million pps effective ceiling vs. the ~10K-pps ceiling of a central cluster on ~10 ENIs.
Trade-offs¶
- More processes to monitor. Every host now has a DNS resolver to observe; an incident on any one host is less severe but may go unnoticed.
- Cache warm-up time. Local caches warm independently; fleet-wide cache warm-up is slower than warming a central cluster.
- Observability split. DNS telemetry is now per-host rather than per-cluster; aggregation is the consumer's problem.
- Configuration drift risk. Per-host configuration needs a fleet-wide rollout mechanism (Chef / Ansible / similar).
Seen in¶
- Stripe — The secret life of DNS packets (2024-12-12).
Canonical wiki instance. Stripe added forwarding rules on
every Hadoop host's local Unbound for
in-addr.arpa(reverse DNS) directly to the VPC resolver, split into10.in-addr.arpa.(private, fast, Route 53) and the generic.in-addr.arpa.(public, slow, upstream nameservers). After rollout the hourly VPC-resolver saturation disappeared and SERVFAIL spikes went away.