Skip to content

Stripe

Stripe Engineering (stripe.com/blog) is a Tier-1 source on the sysdesign-wiki. Stripe is a global payments company whose engineering blog documents infrastructure, reliability, and developer-tooling work at payments-grade scale. Coverage on the wiki opens with the 2024-12-12 DNS-infrastructure retrospective on an AWS-VPC-resolver packet-rate saturation incident.

Key systems

  • systems/unbound — open-source recursive DNS resolver; Stripe runs it on every host (local caching tier) and on a central cluster of DNS servers. The central cluster forwards by zone: service-discovery queries to systems/consul, Route-53-hosted domains and public Internet to the systems/aws-vpc-resolver.
  • systems/aws-vpc-resolver — rate-limited to 1,024 pps per network interface; canonical source of the Stripe DNS-packet investigation.
  • systems/consul — Stripe's service-discovery substrate, fronted by Unbound so application code uses plain DNS to resolve services.
  • systems/aws-route-53 — host for Stripe's configured private domains.

Recent articles

  • 2024-12-12The secret life of DNS packets: investigating complex networks. Investigation of hourly SERVFAIL spikes on internal DNS requests. Root cause: a Hadoop job reverse-resolving IPs in Cloudflare's 104.16.0.0/12 block saturated the AWS VPC resolver's 1,024-pps-per-ENI cap on the central DNS server cluster. Retries at client + local-resolver + cluster-resolver levels amplified traffic ~7×. Fix: distribute reverse-lookup forwarding from the central cluster to each host's local systems/unbound (the AWS cap is per-ENI, so N hosts = N× the effective ceiling). Canonicalises the VPC-resolver packet-rate-limit concept, DNS request amplification via retries concept, request-queue-depth metric concept, iptables-packet-counter-for-rate-metric pattern, and time-bucketed-tcpdump-capture pattern.

Themes

  • Incident investigation at the DNS layer. Stripe's debugging tool chain for DNS involves systems/tcpdump time-bucketed pcap captures, Unbound's dump_requestlist control interface, and systems/iptables packet counters re-used as a DNS-rate-metric source. The stack composes built-in Linux kernel primitives into a rate-observability substrate without deploying a new agent.
  • Per-ENI rate limits as architectural constraints. AWS infrastructure rate limits are often per-interface; fixing saturation by distributing load across more interfaces rather than raising the limit is a recurring Stripe-style pattern — documented here as patterns/distribute-dns-load-to-host-resolver and analogous in shape to other per-ENI limits (e.g. EC2's per-ENI packet-per-second network performance, per-ENI egress bandwidth).
  • Retry amplification as a first-class failure mode. Layered resolvers (client, local Unbound, cluster Unbound) each apply their own timeout + retry logic; when the bottom-of-stack upstream fails or slows, the amplification compounds multiplicatively rather than additively. Canonical concepts/dns-request-amplification-via-retries.
Last updated · 470 distilled / 1,213 read