Skip to content

STRIPE 2024-12-12 Tier 1

Read original ↗

Stripe — The secret life of DNS packets: investigating complex networks

Summary

Stripe's 2024-12-12 incident-investigation retrospective on an hourly spike of DNS SERVFAIL responses for a small percentage of internal requests. Root cause: a Hadoop job that performed reverse DNS lookups on public IP addresses (mostly in the 104.16.0.0/12 range, Cloudflare's block) saturated the 1,024 packets/sec AWS VPC resolver limit on Stripe's DNS server cluster. Retries by both local on-host systems/unbound resolvers and the cluster-level Unbound servers amplified the outbound packet rate by an average . Fix: distribute the reverse-lookup load by forwarding in-addr.arpa queries from each host's local Unbound directly to the VPC resolver — the AWS limit is per network interface, so splitting the load across N hosts multiplies the effective ceiling by N.

The post canonicalises Stripe's DNS-infrastructure shape (Unbound on every host plus a central cluster of Unbound servers forwarding to systems/consul for service discovery, Route 53 domains, and the VPC resolver for everything else), the debugging tool chain (Unbound statistics feeding Datadog, unbound-control dump_requestlist, systems/tcpdump time-bucketed to 60 s pcap files, systems/iptables packet counters on the OUTPUT chain as a rate-measurement primitive), and the architectural fix (push reverse-lookup forwarding to leaf hosts instead of concentrating it at the central resolver cluster).

Key takeaways

  • AWS VPC resolver enforces a hard 1,024 pps per interface limit. Stripe's central DNS servers were sending 257,430 packets in a 60-second tcpdump window and receiving only 61,385 back (average 1,023 pps on the reply side — exactly the documented AWS limit). The limit is not advertised as a DNS-specific cap; it is a general VPC-resolver packet-rate cap applied per ENI. When a single host concentrates DNS traffic for a cluster, it becomes the choke point. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
  • SERVFAIL is a generic errno — the request list depth metric is what actually localises the problem. Unbound exposes a requestlist.exceeded / internal-todo-list depth metric; when it grows without a corresponding growth in DNS query rate, the bottleneck is upstream response latency, not local load. Stripe used this signal to rule out "too many queries" and pivot to "upstream is timing out." Canonical concepts/request-queue-depth-metric instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
  • Retries amplify traffic ~7×. Client (Hadoop) retried 5× per failed reverse lookup. On top, local per-host Unbound and the cluster Unbound each retry on timeout, with independent timeout calculations. Average amplification was 7× the base query volume — a canonical concepts/dns-request-amplification-via-retries datum: failure plus cached-retry-timeouts make a bad situation compounding rather than steady-state. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
  • iptables OUTPUT chain as a lightweight packet-rate metric primitive. Stripe added a rule matching destination=10.0.0.2 (the VPC resolver static IP, always VPC-base + 2) that jumps to an empty VPC_RESOLVER chain. A 1-second-sleep loop reads iptables -L VPC_RESOLVER 1 -x -n -v and reports the packet counter to the metrics pipeline. No packet capture, no kernel tracing — existing kernel counters repurposed as a rate-metric source. Canonical patterns/iptables-packet-counter-for-rate-metric instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
  • Time-bucketed tcpdump to 60-second pcap files made spike analysis tractable. tcpdump -n -tt -i any -W 30 -G 60 -w '%FT%T.pcap' port 53 captures port-53 traffic into 30 rolling 60-second files named by ISO-8601 timestamp. This bounds the per-file size, aligns the capture window with the observed spike cadence (hourly bursts lasting several minutes), and lets operators grep / replay a specific 60-second slice in Wireshark without loading gigabytes. Canonical patterns/time-bucketed-tcpdump-capture instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)
  • Fix is topological, not rate-limit tuning. Stripe couldn't raise the AWS cap. They couldn't convince the Hadoop job to stop reverse-resolving every IP in its network-activity logs. The fix is architectural: the VPC-resolver limit is per-ENI, so move the reverse-lookup load off the central-cluster ENIs onto the thousands of Hadoop-host ENIs. Each host's local Unbound forwards in-addr.arpa directly to the VPC resolver; the central cluster is no longer in the reverse-DNS path. Two forwarding rules distinguish private (10.in-addr.arpa., fast, Route 53) from public (.in-addr.arpa., slow, upstream nameservers), so Unbound's smoothed-RTT timeout calculations stay separated. Canonical patterns/distribute-dns-load-to-host-resolver instance. (Source: sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks)

Systems extracted

  • systems/unbound — open-source recursive DNS resolver used at Stripe both on every host (local caching tier) and on a central cluster of DNS servers.
  • systems/aws-vpc-resolver — AWS-provided DNS resolver reachable at VPC base IP + 2 (e.g. 10.0.0.2 if base is 10.0.0.0). Handles Route 53 domains and public Internet names. Rate-limited to 1,024 pps per ENI.
  • systems/aws-route-53 — AWS DNS service hosting Stripe's configured private domains; queries for these names from inside the VPC are forwarded via the VPC resolver.
  • systems/tcpdump — libpcap-based packet capture tool; Stripe used the -W 30 -G 60 time-bucketing flags to bound pcap file size.
  • systems/iptables — Linux packet-filtering framework; Stripe repurposed its built-in per-rule packet counters on the OUTPUT chain as a DNS-rate-metric source.
  • systems/dig — BIND's DNS debugging tool; Stripe used it to manually query the VPC resolver with representative reverse addresses to confirm slow responses.
  • systems/consul — HashiCorp's service-discovery system; Stripe's central Unbound cluster forwards service-discovery queries to it.
  • systems/hadoop — Apache Hadoop cluster; the specific job that triggered the incident analyses network-activity logs and reverse-resolves every IP encountered.

Concepts extracted

Patterns extracted

Operational numbers disclosed

  • 1,024 pps — AWS VPC resolver limit per network interface.
  • 257,430 outbound / 61,385 inbound packets in a single 60-second tcpdump window on the central DNS server during a spike.
  • ~1,023 pps average reply rate (~exactly the documented AWS cap, implying AWS is dropping the excess at the VPC level).
  • ~7× average DNS request amplification from client + local-resolver + cluster-resolver retries.
  • 5 client-side retries per failed reverse lookup.
  • 90% of queries to the VPC resolver during a spike were reverse-DNS lookups for 104.16.0.0/12 (Cloudflare's block).
  • ~hourly spike cadence, lasting several minutes each.

Caveats

  • Single-incident retrospective, not a fleet-wide survey; no disclosure of Stripe's total DNS query rate, cluster size, host count, or base-line VPC-resolver utilisation.
  • AWS VPC resolver 1,024-pps limit is documented but not widely-advertised; the exact behaviour at the limit (drop vs queue vs backpressure) is not spelled out by AWS.
  • Hadoop job's need to reverse-resolve every IP is treated as a given; the post doesn't discuss disabling reverse DNS in the log-analysis pipeline or caching PTR lookups out-of-band, both of which could have solved the problem differently.
  • Fix is forward-looking: the post doesn't disclose per-host reverse-lookup rate after rollout, nor whether the per-host Unbound instances ever hit the 1,024-pps ceiling themselves.
  • Post is written by the DNS / Traffic infrastructure team; no post-mortem PDF, no incident timeline, no customer-impact quantification.
  • The 104.16.0.0/12 range is called out as Cloudflare's, but the post doesn't explain why a Cloudflare block's PTR records specifically were slow (most Cloudflare IPs do have PTR records via cloudflare.com; the slowness is presumably authoritative-nameserver latency, not missing records).

Source

Last updated · 470 distilled / 1,213 read