Skip to content

HIGHSCALABILITY 2022-12-02 Tier 1

Read original ↗

High Scalability — Stuff The Internet Says On Scalability For December 2nd, 2022

Summary

Todd Hoff's weekly curated link roundup on highscalability.com for the week of 2022-12-02. The companion piece to the earlier-ingested July 11, 2022 roundup, with the same Number Stuff / Quotable Stuff / Useful Stuff / Soft Stuff / Pub Stuff structure. Value for this wiki: the Useful Stuff section distils several long-form engineering posts from late 2022 — Snap's Snapchat architecture on AWS, the October 2021 Roblox 73-hour outage (HashiStack + Consul streaming regression), Cloudflare Pingora, Walmart inventory-reservations API, Netflix Titus Gateway consistent-caching rebuild, Honeycomb's serverless storage layer, Azure Cosmos DB's NUMA-aware design, Instacart's Postgres → DynamoDB push-notifications migration, Meta PTP at 100× NTP, Tinder's TAG API Gateway, Shopify's $1M-to-$1.4k BigQuery clustering fix, Cloudflare R2 / Workers / D1 "Supercloud", and the Homa-vs-TCP datacenter-transport debate. Plus Twitter's post-layoffs infrastructure thread-of-threads.

Most of the Quotable Stuff captures running debates (serverless-vs-containers, cloud-vs-on-prem, SQL-vs-NoSQL, RDBMS-joins-vs-NoSQL-log(n), microservices-as-resume-padding, Kubernetes-on-prem-as-hell) rather than settled conclusions — cite the quote + the speaker rather than the wiki's voice.

Key takeaways

  1. Snap's Snapchat on AWS handles 300M DAU, 5B+ snaps/day, 10M QPS, 400 TB in DynamoDB (nightly scans of ~2B rows/min), on 900+ EKS clusters with 1000+ instances each. Send-path flow: client → EKS Gateway service → Media service → CloudFront+S3 (store once, near recipient) → core orchestration service (MCS) → friend-graph permission check → metadata persisted in SnapDB. SnapDB is a DynamoDB-backed database that handles transactions, TTL, and ephemeral-data state sync. Receive-path is latency-sensitive: MCS looks up the recipient's persistent connection in ElastiCache and pushes through the connection-owning server. Architecture migration reduced P50 latency 24%. Scaling levers: auto-scaling + Graviton instance optimization. (Source: Snap AWS re:Invent talk "Journey of a Snap on Snapchat Using AWS".)
  2. The 73-hour Roblox October 2021 outage was caused by enabling Consul's new streaming feature on traffic-routing backends under high read+write load; under contention streaming exacerbated contention on a single Go channel, blocking Consul KV writes. Roblox's core infra runs on-prem: 18,000 servers, 170,000 containers, on Consul + Nomad + Vault ("HashiStack"). Streaming was enabled 2021-10-27 14:00, one day before the outage; node count was concurrently increased 50% for expected end-of-year traffic. Consul KV writes blocked for long periods — "contention" was the diagnosis after 10 hours of debug-log / OS-metric analysis. The resolution: roll streaming back across all Consul systems (completed 2021-10-28 15:51), at which point Consul KV write P50 dropped to 300ms. HashiCorp's post-mortem explanation: streaming used fewer Go channels than long-polling, so at very high concurrent read+write load contention on a single channel made streaming less efficient, not more. Remediation in flight: an additional geographically-distinct datacenter + multiple availability zones within each datacenter. (Source: systems/roblox-hashistack, concepts/consul-streaming-vs-longpoll.) This is also the source of the famous "1 billion requests per second handled by Roblox's caching system" statistic that shows up in the Number Stuff.
  3. Cloudflare's Pingora Rust proxy serves >1 trillion requests/day and cut median TTFB 5ms / P95 TTFB 80ms vs. the NGINX-based predecessor while consuming 70% less CPU and 67% less memory. Key design choices: Rust (for memory safety without the C-equivalent cost); multithreading over multiprocessing so the connection pool can be shared; work-stealing scheduling; NGINX-style "life-of-a-request" event-based programmable interface; share-connections-across-all-threads architecture → the connection-reuse ratio for one major customer went from 87.1% to 99.92%, a 160× reduction in new connections to origin. Pingora makes ~⅓ as many new connections per second overall. "Pingora crashes are so rare we usually find unrelated issues when we do encounter one." (Source: systems/pingora.)
  4. Netflix rebuilt Titus Gateway around a consistent-caching horizontal-scale-out layer after hitting vertical-scaling limits on read-only API serving. The mechanism: any system relying on a singleton leader-elected source of truth for managed data where the data fits in memory and latency is low can be fronted by a consistent-caching API layer, shifting API read-processing from a single box to a horizontally-scalable tier. Result: better tail latency with a minor median-latency tradeoff at low traffic; unlimited scalability of the read-only API; clients require no change. (Source: systems/titus-gateway, concepts/consistent-caching-horizontal-scale.)
  5. Walmart's inventory-reservations API scales write-heavy peak traffic with three stacked patterns: sticky-session scatter-gather, per-partition actor-with-mailbox, and in-memory snapshot-state cache. (a) Scatter-gather with sticky session so the same DB partition is always handled by the same app instance; (b) actor-per-partition with a mailbox restricts processing of each partition to a single thread, also enabling batch processing of same-partition requests; (c) in-memory snapshot cache cuts read volume. (Source: systems/walmart-inventory-reservations, patterns/sticky-session-scatter-gather, patterns/in-memory-partition-actor.)
  6. Tinder's "TAG" API Gateway is a JVM/Spring-Cloud-Gateway platform serving 500+ Tinder microservices — and B2C + B2B traffic for Hinge, OkCupid, PlentyOfFish, Ship. Application teams create a gateway instance by writing configurations, not code. TAG extends Spring Cloud Gateway's gateway and global filter components with pre-built filters for: weighted routing, request/response transformations, HTTP↔gRPC conversion, auth, rate limiting. Centralizes external-facing APIs + enforces consistent authZ across Match Group brands. (Source: systems/tinder-api-gateway, patterns/graphql-unified-api-platform (TAG is the spiritual sibling to Twitter's 2022 GraphQL unification).)
  7. Twitter's 2022 GraphQL platform unified API evolution from "weeks" to "hours" and serves 1.5 billion GraphQL fields per second across 3,000 data types. From @jbell (Twitter engineering): "Over 5 years, we've migrated almost every api used by the Twitter app into a unified graphql interface… For the last 2 years, EVERY SINGLE new Twitter feature was delivered on this api platform. Twitter blue, spaces, tweet edit, birdwatch…" In calendar 2022 alone: 1000+ changes by 300+ developers. This is also the data point behind the post's Snap/Tinder/Roblox-adjacent number: 1 billion daily GraphQL requests at Netflix. (Source: patterns/graphql-unified-api-platform.)
  8. Meta Precision Time Protocol (PTP) delivers ~100× better datacenter clock accuracy than state-of-the-art NTP. The implication for Spanner-style database design is direct: snapshot reads based on a monotonic clock require a commit-time delay equal to the worst-case clock uncertainty, so more accurate timestamps → less delay → more server capacity per dollar. From Meta's engineering post: "One could argue that we don't really need PTP for that. NTP will do just fine. Well, we thought that too. But experiments we ran comparing our state-of-the-art NTP implementation and an early version of PTP showed a roughly 100x performance difference." (Source: systems/meta-ptp.)
  9. Azure Cosmos DB scales to 100M req/s across petabytes of data distributed over 41 regions for a single customer, by never spanning a single process across NUMA nodes. "When a process is across the NUMA node, memory access latency can increase if the cache misses. Not crossing the NUMA node gives our process a more predictable performance." Cosmos's storage engine: lock-free B+ tree for indexing, log-structured storage on local SSD (not remote-attached), batching for network/disk I/O reduction, custom allocators tuned to request patterns, custom async scheduler with coroutines. (Source: systems/azure-cosmos-db.)
  10. Instacart migrated their push-notifications workload from Postgres to DynamoDB when they projected 8× traffic growth would exceed the largest available EC2 instance size. Their general rule: new standalone workloads that are either a straightforward fit for DynamoDB or one that Postgres wouldn't handle well → DynamoDB from day one; migrating existing Postgres tables has a higher bar. Push-notification redesign cut costs ~½ vs. Postgres + scaled on-demand to support "feature launches that change message volume dramatically"; in 6 months they went from 1 to 20+ tables supporting 5-10 features. They thinly-wrapped Dynamoid to expose an ActiveRecord-like Ruby interface. (Source: systems/dynamodb.)
  11. Shopify's $1M BigQuery query became a $1.4k query by clustering the table on two columns from the WHERE clause. 75 GB → 508 MB billed per query; 150× reduction in data scanned on the same query plan. The generalized pattern on any data-scanned-priced OLAP warehouse: understand the access pattern's filter predicates, make them physically adjacent on storage. (Source: patterns/clustered-table-for-where-clause-cost.)
  12. Cloudflare's 2022 Developer Week positions it as a "Supercloud" alternative to AWS for serverless + edge workloads. The shipping platform: Workers (compute) + Durable Objects (actor-model state) + Queues + R2 (S3-compatible object store, zero egress fees) + Cache Reserve (CDN with 8000+ PUT/s and 3000+ GET/s per major customer, 600 TB cached for large customers) + D1 (relational DB, in beta) + Pages + Functions. The egress-fee angle is the big wedge: "On CloudFront 60 TB of egress will be anywhere from $4,761.60 to $6,348.80 depending entirely on where in the world the users requesting the file are." (Source: R2 marketing + commentary.)
  13. "It's Time to Replace TCP in the Datacenter" — John Ousterhout's Homa transport design argues TCP's streaming/in-order/ACK-driven design is fundamentally mismatched to datacenter RPC workloads and its problems are too interrelated to patch. AWS agrees conceptually: their Scalable Reliable Datagram (SRD) protocol (systems/aws-srd) optimizes for performance over reliability ("a datacenter isn't the internet"), runs on Nitro dedicated hardware, uses multipath, microsecond-level retries, Ethernet-based, primarily reduces EBS tail latency since average latency isn't what hurts storage. Google's parallel work: Snap microkernel host networking, Aquila unified low-latency datacenter fabric, CliqueMap RMA-based distributed cache. (Source: systems/homa-transport, systems/aws-srd, concepts/multi-path-datacenter-transport.)
  14. Honeycomb used AWS Lambda to speed up their servers — not despite serverless but because of its fan-out profile for bulk, urgent, parallelizable workloads. 50 ms median Lambda startup with "very little difference between hot and cold startups"; 90% respond within 2.5s; 3-4× more expensive per CPU-second than EC2 but used much less often. Storage engine they pair it with: lock-free B+ tree for indexing; log-structured storage on local SSD (not remote-attached); batching for disk I/O reduction; custom allocators tuned to request patterns; custom async scheduler with coroutines. (Source: systems/honeycomb.)
  15. PlanetScale Boost is a partial materialized-view engine bolted onto a MySQL/Vitess database that claims up to 1000× query-performance improvements by shifting the compute from read-time to write-time. From the post: "Instead of planning how to fetch the data every time you read from your database, we want to plan how to process the data every time you write to your database… a plan where we push data on writes instead of pulling data on reads." Queries can still miss but the engine is "smarter about resolving these misses." (Source: systems/planetscale, patterns/partial-materialized-views.)

Operational numbers captured

From Number Stuff and body commentary:

  • Twitter GraphQL: 1.5 billion fields/sec across 3000 data types.
  • AWS Lambda: 1M customers, 10 trillion monthly invocations.
  • Netflix: 1 billion daily GraphQL requests.
  • Roblox caching: 1 billion req/s.
  • Stripe Cyber Monday: 20,000+ RPS at >99.9999% API success.
  • Google DDoS mitigation: 46M req/s (largest ever) blocked.
  • JWST: 57 GB/day produced.
  • Snap: 300M DAU, 5B+ snaps/day, 10M QPS, 400 TB in DynamoDB, 900+ EKS clusters × 1000+ instances.
  • Meta PTP: 100× NTP accuracy.
  • Pingora: >1 trillion req/day, −70% CPU, −67% memory, 5 ms P50 TTFB reduction, 80 ms P95 TTFB reduction, 160× fewer new origin connections per second for one customer (87.1% → 99.92% connection-reuse ratio).
  • Azure Cosmos DB (one customer): 100M req/s, petabytes of storage, 41 regions.
  • Shopify BigQuery: 75 GB → 508 MB billed per query after clustering (150× reduction), $949K/mo → $1.37K/mo projected.
  • Roblox post-outage Consul KV write P50: 300 ms.
  • Basecamp HEY: $3M/year cloud spend, $500K+ of it on RDS+OpenSearch alone.
  • Couchbase edge: 80% latency reduction.
  • 1.84 petabits/sec: single-light-source optical transmission (~2× global internet traffic in the same interval).
  • 6 ronnagrams: mass of Earth in new SI prefix.
  • Cumulus cost rewrite: 71% monthly bill reduction.
  • Prerender.io cost migration: $1M/yr → $200K/yr (80% reduction by migrating cache off AWS S3 + RDS).
  • Recursive-Lambda runaway: $40,000 accidental spend.
  • AWS CloudFront egress (60 TB): $4,761.60 – $6,348.80 depending on user-region mix.
  • Fortuitous anchors: 1 TB/month Starlink cap; 433 qubits in IBM's newest quantum computer.

Caveats

  • Roundup format — the wiki's pages treat individual items as first-class citations, but "High Scalability" itself is the aggregator, not the source. Every claim should be traced back to its linked Useful-Stuff entry.
  • Dated 2022-12-02 — the Twitter-post-layoffs debate, the Basecamp-leaving-cloud framing, and the AWS re:Invent 2022 Graviton/Nitro/Trainium namedrops are all frozen in time. Some of those situations have since evolved.
  • Several Useful-Stuff entries are Hoff's paraphrase, not direct quotes from the engineering post — especially Snap, Roblox, Tinder, Meta PTP, Azure Cosmos DB. Wiki pages downstream should cite the original engineering post when they quote numbers directly.
  • Twitter post-layoffs claims (Petrillic, atax1a, MosquitoCapital, jbell, suryaj, theyanis) are well-informed employee speculation, not architectural post-mortem. They captured the moment — but are not appropriate as ground truth for how Twitter worked then or works now.
  • Some quotes represent running debates, not consensus@ahidalgosre on "autoscaling is an anti-pattern", @kellabyte on Kubernetes re-platforming flameouts, @ahmetb on on-prem Kubernetes "hell on earth", Crockford on retiring JavaScript, @copyconstruct on "Twitter was failing because of leadership, not engineering". Cite the quote + the speaker rather than generalizing.
  • The "serverless spectrum" / Ben Kehoe thread is Hoff's commentary on Kehoe's Medium post — the wiki position should track Kehoe's original article, not Hoff's summary.
  • Homa-replaces-TCP is an active research position, not a production outcome. AWS SRD is a concrete parallel effort; Google's Snap/Aquila/CliqueMap are research disclosures. No major datacenter has actually displaced TCP fleet-wide as of 2022.

Source

Last updated · 319 distilled / 1,201 read