High Scalability¶
High Scalability (highscalability.com) is Todd Hoff's long-running scalability-and-distributed-systems weblog, best known for its weekly "Stuff The Internet Says On Scalability" roundups and occasional long-form architectural deep-dives (e.g. "Behind AWS S3's Massive Scale", "Brief History of Scaling Uber").
For this wiki, High Scalability functions as a secondary aggregator: each roundup co-locates 15–30 distinct production-scale data points and engineering links that would otherwise require many separate ingests, and captures the running debates of the moment (monolith-vs-microservices, serverless-vs-containers, cloud-vs-on-prem, SQL-vs-NoSQL, etc.) in the voices of practitioners. Individual claims always carry the credibility of their original source, not the aggregator — wiki pages citing a roundup should, where possible, trace through to the original engineering post.
Tier classification¶
Tier 1 — canonical aggregator blog, cross-referenced across the corpus. Content quality is high because Hoff's selection filter is sharp.
Skip rules specific to High Scalability¶
- "Sponsored Post:" entries — skipped (pure ad copy).
- Book-excerpt / opinion columns with no architecture content (e.g. "The Cloud is Not a Railroad", "What is Cloud Computing? According to ChatGPT") — skipped.
- Roundups with <20% architecture content — skipped; in practice almost all Stuff The Internet Says editions pass the 20% bar because of the Useful Stuff section.
Recent articles¶
- 2024-05-09 — Kafka 101 (Stanislav Kozlovski guest post)
- 2024-03-06 — Behind AWS S3's Massive Scale
- 2023-08-16 — The Swedbank Outage shows that Change Controls don't work
- 2023-07-16 — Lessons Learned Running Presto at Meta Scale
- 2023-07-16 — Gossip Protocol Explained
- 2023-02-22 — Consistent Hashing Algorithm
- 2022-12-02 — Stuff The Internet Says On Scalability For December 2nd, 2022
- 2022-07-11 — Stuff The Internet Says On Scalability For July 11th, 2022
Key systems surfaced via High Scalability¶
- systems/apache-cassandra, systems/amazon-dynamo, systems/cockroachdb, systems/riak, systems/hyperledger-fabric, systems/bitcoin-gossip, systems/swim-protocol — nine canonical gossip-protocol deployments named in the 2023 Gossip Protocol Explained explainer.
- systems/aws-s3 + systems/shardstore — the 2024 Behind AWS S3's Massive Scale explainer by Stanislav Kozlovski distills Warfield's FAST '23 keynote + SOSP 2021 ShardStore paper + AWS public material into one compact tour. Adds ShardStore's LSM-tree-plus-out-of-tree-shards structure and the 300+ microservices / 100M req-sec / 400 Tbps / 280T object / 31-region / 99-AZ 2024 scale numbers.
- systems/kafka + systems/kafka-connect +
systems/kafka-streams + systems/apache-zookeeper +
systems/kraft + systems/cruise-control +
systems/confluent-kora + systems/redpanda +
systems/warpstream — the 2024-05-09 Kafka 101 explainer
(second Stanislav Kozlovski guest post on this blog) distills
13 years of Apache Kafka into a single architectural tour.
Adds the full
distributed-log +
partition +
ISR +
acks-dial + consumer-group substrate vocabulary; names the KRaft__cluster_metadata-as-log mechanism replacing ZooKeeper (3.3+, full removal 4.0); frames Tiered Storage against the four co-located-broker structural walls (log recovery, historical-read IOPS exhaustion, full re-replication, rebalance motion) with the 43% producer-performance improvement dev-test datapoint; introduces Cruise Control as the canonical open-source bin-packing rebalancer; closes with the "everybody standardizes on the Kafka API and competes on the underlying implementation" industry-trajectory forecast naming Confluent Kora + Redpanda + WarpStream. Canonical upstream-framing source for every Kafka system on the wiki. - systems/snap-architecture — Snapchat on AWS, via Snap's re:Invent 2022 talk summarized in the Dec-2022 roundup.
- systems/roblox-hashistack — the 73-hour Oct-2021 Roblox outage and the HashiStack architecture behind it.
- systems/pingora — Cloudflare's NGINX-replacement Rust proxy serving >1T req/day.
- systems/titus-gateway — Netflix's consistent-caching horizontal-scale rebuild.
- systems/walmart-inventory-reservations — scatter-gather + actor-per-partition write-heavy API.
- systems/tinder-api-gateway — Tinder's TAG JVM/Spring gateway platform.
- systems/meta-ptp — Precision Time Protocol deployment at Meta.
- systems/azure-cosmos-db — NUMA-aware engine design.
- systems/homa-transport — proposed TCP replacement for datacenter RPC.
- systems/aws-srd — Amazon's non-TCP datacenter transport.
- systems/owl-content-distribution — Meta's 800 PB/day centralized-control peer-to-peer system.
- systems/stack-overflow-architecture — 1.3B views/month on a 9-server .NET monolith.
- systems/pinterest-memcached-fleet — 5000 EC2, 180M req/s, SCHED_FIFO + TCP Fast Open + extstore NVMe.
- systems/new-world-amazon-games — 30 Hz MMO simulation on AWS.
- systems/swedbank-core-banking — April 2022 unapproved-change outage; SEK 850M fine; canonical recent case for CAB-style change management failing to catch undocumented production changes.
- systems/knight-capital-smars — 2012 partial-deployment drift that caused a $460M trading loss in 45 minutes; cross-referenced as the prior-art shape of the Swedbank failure mode.
- systems/graviton3, systems/dragonflydb