Stripe — The secret life of DNS packets: investigating complex networks¶
Summary¶
Stripe's 2024-12-12 incident-investigation retrospective on an hourly
spike of DNS SERVFAIL responses for a small percentage of internal
requests. Root cause: a Hadoop job that performed reverse DNS
lookups on public IP addresses (mostly in the 104.16.0.0/12 range,
Cloudflare's block) saturated the 1,024 packets/sec AWS VPC
resolver limit on Stripe's DNS server cluster. Retries by both
local on-host systems/unbound resolvers and the cluster-level
Unbound servers amplified the outbound packet rate by an average
7×. Fix: distribute the reverse-lookup load by forwarding
in-addr.arpa queries from each host's local Unbound directly to
the VPC resolver — the AWS limit is
per network interface, so splitting the load across N hosts
multiplies the effective ceiling by N.
The post canonicalises Stripe's DNS-infrastructure shape (Unbound on
every host plus a central cluster of Unbound servers forwarding to
systems/consul for service discovery, Route 53
domains, and the VPC resolver for everything else), the debugging
tool chain (Unbound statistics feeding Datadog, unbound-control
dump_requestlist, systems/tcpdump time-bucketed to 60 s pcap
files, systems/iptables packet counters on the OUTPUT chain
as a rate-measurement primitive), and the architectural fix
(push reverse-lookup forwarding to leaf hosts instead of
concentrating it at the central resolver cluster).
Key takeaways¶
- AWS VPC resolver enforces a hard 1,024 pps per interface limit. Stripe's central DNS servers were sending 257,430 packets in a 60-second tcpdump window and receiving only 61,385 back (average 1,023 pps on the reply side — exactly the documented AWS limit). The limit is not advertised as a DNS-specific cap; it is a general VPC-resolver packet-rate cap applied per ENI. When a single host concentrates DNS traffic for a cluster, it becomes the choke point. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
SERVFAILis a generic errno — the request list depth metric is what actually localises the problem. Unbound exposes arequestlist.exceeded/ internal-todo-list depth metric; when it grows without a corresponding growth in DNS query rate, the bottleneck is upstream response latency, not local load. Stripe used this signal to rule out "too many queries" and pivot to "upstream is timing out." Canonical concepts/request-queue-depth-metric instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)- Retries amplify traffic ~7×. Client (Hadoop) retried 5× per failed reverse lookup. On top, local per-host Unbound and the cluster Unbound each retry on timeout, with independent timeout calculations. Average amplification was 7× the base query volume — a canonical concepts/dns-request-amplification-via-retries datum: failure plus cached-retry-timeouts make a bad situation compounding rather than steady-state. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
- iptables
OUTPUTchain as a lightweight packet-rate metric primitive. Stripe added a rule matchingdestination=10.0.0.2(the VPC resolver static IP, always VPC-base + 2) that jumps to an emptyVPC_RESOLVERchain. A 1-second-sleep loop readsiptables -L VPC_RESOLVER 1 -x -n -vand reports the packet counter to the metrics pipeline. No packet capture, no kernel tracing — existing kernel counters repurposed as a rate-metric source. Canonical patterns/iptables-packet-counter-for-rate-metric instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks) - Time-bucketed tcpdump to 60-second pcap files made spike
analysis tractable.
tcpdump -n -tt -i any -W 30 -G 60 -w '%FT%T.pcap' port 53captures port-53 traffic into 30 rolling 60-second files named by ISO-8601 timestamp. This bounds the per-file size, aligns the capture window with the observed spike cadence (hourly bursts lasting several minutes), and lets operators grep / replay a specific 60-second slice in Wireshark without loading gigabytes. Canonical patterns/time-bucketed-tcpdump-capture instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks) - Fix is topological, not rate-limit tuning. Stripe couldn't
raise the AWS cap. They couldn't convince the Hadoop job to
stop reverse-resolving every IP in its network-activity logs.
The fix is architectural: the VPC-resolver limit is per-ENI,
so move the reverse-lookup load off the central-cluster ENIs
onto the thousands of Hadoop-host ENIs. Each host's local
Unbound forwards
in-addr.arpadirectly to the VPC resolver; the central cluster is no longer in the reverse-DNS path. Two forwarding rules distinguish private (10.in-addr.arpa., fast, Route 53) from public (.in-addr.arpa., slow, upstream nameservers), so Unbound's smoothed-RTT timeout calculations stay separated. Canonical patterns/distribute-dns-load-to-host-resolver instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
Systems extracted¶
- systems/unbound — open-source recursive DNS resolver used at Stripe both on every host (local caching tier) and on a central cluster of DNS servers.
- systems/aws-vpc-resolver — AWS-provided DNS resolver reachable
at VPC base IP + 2 (e.g.
10.0.0.2if base is10.0.0.0). Handles Route 53 domains and public Internet names. Rate-limited to 1,024 pps per ENI. - systems/aws-route-53 — AWS DNS service hosting Stripe's configured private domains; queries for these names from inside the VPC are forwarded via the VPC resolver.
- systems/tcpdump — libpcap-based packet capture tool; Stripe
used the
-W 30 -G 60time-bucketing flags to bound pcap file size. - systems/iptables — Linux packet-filtering framework; Stripe
repurposed its built-in per-rule packet counters on the
OUTPUTchain as a DNS-rate-metric source. - systems/dig — BIND's DNS debugging tool; Stripe used it to manually query the VPC resolver with representative reverse addresses to confirm slow responses.
- systems/consul — HashiCorp's service-discovery system; Stripe's central Unbound cluster forwards service-discovery queries to it.
- systems/hadoop — Apache Hadoop cluster; the specific job that triggered the incident analyses network-activity logs and reverse-resolves every IP encountered.
Concepts extracted¶
- concepts/dns-reverse-lookup-ptr — reverse DNS: given an IP,
find the PTR-record hostname. Queries use the
in-addr.arpapseudo-domain. The public Internet's PTR infrastructure is much less reliable than forward-DNS; many IPs have no PTR, others time out. - concepts/dns-resolver-caching — Unbound caches both successful answers and negative answers with TTL-bounded lifetimes; local caching on every host reduces cluster load.
- concepts/vpc-resolver-packet-rate-limit — the per-ENI 1,024-pps AWS limit canonicalised as a first-class architectural constraint.
- concepts/dns-request-amplification-via-retries — multi-layer retry (client → local resolver → cluster resolver) turns one failed query into ~7 outbound packets.
- concepts/dns-servfail-response — the generic DNS server-failure response code; useless for root-cause localisation on its own.
- concepts/request-queue-depth-metric — internal to-do-list depth as a leading-indicator metric when query rate is not the bottleneck.
Patterns extracted¶
- patterns/distribute-dns-load-to-host-resolver — move DNS load off a central cluster by forwarding per-zone queries from each host's local resolver.
- patterns/iptables-packet-counter-for-rate-metric — add a rule matching the target IP that jumps to an empty chain, poll the counter.
- patterns/time-bucketed-tcpdump-capture —
-W N -G S -w '%FT%T.pcap'rolling window capture.
Operational numbers disclosed¶
- 1,024 pps — AWS VPC resolver limit per network interface.
- 257,430 outbound / 61,385 inbound packets in a single 60-second tcpdump window on the central DNS server during a spike.
- ~1,023 pps average reply rate (~exactly the documented AWS cap, implying AWS is dropping the excess at the VPC level).
- ~7× average DNS request amplification from client + local-resolver + cluster-resolver retries.
- 5 client-side retries per failed reverse lookup.
- 90% of queries to the VPC resolver during a spike were
reverse-DNS lookups for
104.16.0.0/12(Cloudflare's block). - ~hourly spike cadence, lasting several minutes each.
Caveats¶
- Single-incident retrospective, not a fleet-wide survey; no disclosure of Stripe's total DNS query rate, cluster size, host count, or base-line VPC-resolver utilisation.
- AWS VPC resolver 1,024-pps limit is documented but not widely-advertised; the exact behaviour at the limit (drop vs queue vs backpressure) is not spelled out by AWS.
- Hadoop job's need to reverse-resolve every IP is treated as a given; the post doesn't discuss disabling reverse DNS in the log-analysis pipeline or caching PTR lookups out-of-band, both of which could have solved the problem differently.
- Fix is forward-looking: the post doesn't disclose per-host reverse-lookup rate after rollout, nor whether the per-host Unbound instances ever hit the 1,024-pps ceiling themselves.
- Post is written by the DNS / Traffic infrastructure team; no post-mortem PDF, no incident timeline, no customer-impact quantification.
- The 104.16.0.0/12 range is called out as Cloudflare's, but the
post doesn't explain why a Cloudflare block's PTR records
specifically were slow (most Cloudflare IPs do have PTR
records via
cloudflare.com; the slowness is presumably authoritative-nameserver latency, not missing records).
Source¶
- Original: https://stripe.com/blog/secret-life-of-dns
- Raw markdown:
raw/stripe/2024-12-12-the-secret-life-of-dns-packets-2019-a25d9a94.md
Related¶
- companies/stripe
- systems/unbound, systems/aws-vpc-resolver, systems/aws-route-53, systems/tcpdump, systems/iptables, systems/dig, systems/consul
- concepts/dns-reverse-lookup-ptr, concepts/dns-resolver-caching, concepts/vpc-resolver-packet-rate-limit, concepts/dns-request-amplification-via-retries, concepts/dns-servfail-response, concepts/request-queue-depth-metric
- patterns/distribute-dns-load-to-host-resolver, patterns/iptables-packet-counter-for-rate-metric, patterns/time-bucketed-tcpdump-capture
- concepts/observability — investigation workflow; extending per-query observability with per-target packet-rate metrics
- concepts/blast-radius — the choke-point-at-central-cluster failure-mode is the architectural shape being mitigated by distributing load