Skip to content

System Design — Overview

Synthesis across the wiki's ingested corpus. Sample: 91 sources / 11 companies / 268 system pages / 253 concept pages / 218 pattern pages (as of 2026-04-21). Skewed toward 2025-2026 (78 of 91 sources dated 2025 or later). Companies and sources are cited via [[wiki-links]] — drill in for evidence.

Corpus shape

Company Sources Tier Bias
companies/figma 19 3-eq Client-perf, C++/WASM build, data caching, sharding, DSLs
companies/aws 18 1 Service-design, multi-tenancy, EKS platform, GenAI infra
companies/dropbox 11 2 Sync-engine testing, hardware, Magic Pocket, Dash agentic AI
companies/allthingsdistributed 9 1 S3, DSQL, Lambda PR/FAQs, Byron Cook on automated reasoning
companies/github 7 2 Git server internals, search rewrites, security (PQC, SAML)
companies/datadog 7 3-eq Husky event store, eBPF security, Go runtime, Bits AI SRE
companies/airbnb 6 2 Observability migration, dynamic config, destination ML
companies/databricks 5 3 Proxyless mesh, Dicer auto-sharder, cross-cloud data mesh
companies/expedia 4 3 Iceberg, Kafka Streams, embedding platform, Trino
companies/canva 3 Event counting, CI builds, print-order routing
companies/cloudflare, companies/atlassian 1 each 1 / 3 Internal AI stack / streaming SSR

Gaps: no Netflix, Meta, LinkedIn, Uber, Stripe sources yet — the canonical Tier-1 distributed-systems blogs are under-represented relative to their promise in raw/_feeds.yaml.

Top recurring concepts (by source citation count)

Minimum 6 citations. Full cite-counts via grep -ohE 'concepts/[a-z0-9-]+' wiki/sources/*.md | sort | uniq -c.

  1. concepts/control-plane-data-plane-separation (15) — universal architectural primitive. Appears in config platforms (systems/sitar, systems/figcache), load balancers (systems/dropbox-robinhood), service mesh (systems/databricks-endpoint-discovery-service), sharding (systems/dicer), platform engineering (systems/santander-catalyst).
  2. concepts/tail-latency-at-scale (13) — forcing function behind "why not JVM" decisions. Canonical quantification is the systems/aurora-dsql Crossbar simulation: 40 hosts × 1s stalls → 10s tail, 6K TPS vs target 1M.
  3. concepts/llm-as-judge (11) — ubiquitous in the 2025-2026 agent wave. systems/dspy is the common tooling.
  4. concepts/observability (10) — Airbnb's 5-year OTLP migration, Datadog Husky internals, AWS telemetry-based-discovery.
  5. concepts/specification-driven-development (9) — Kiro, autoformalization, executable specs, lightweight formal verification (ShardStore).
  6. concepts/stateless-compute (8) + concepts/shared-responsibility-model (8) + concepts/lightweight-formal-verification (8) + concepts/agent-context-window (8).
  7. concepts/vector-similarity-search (7) + concepts/simplicity-vs-velocity (7) + concepts/neurosymbolic-ai (7) + concepts/managed-data-plane (7) + concepts/immutable-object-storage (7) + concepts/hybrid-retrieval-bm25-vectors (7) + concepts/disaster-recovery-tiers (7) + concepts/digital-sovereignty (7).
  8. concepts/rag-as-a-judge, concepts/noisy-neighbor, concepts/memory-safety, concepts/knowledge-graph, concepts/compute-storage-separation, concepts/automated-reasoning — all 6 citations.

Top recurring patterns

  1. patterns/prompt-optimizer-flywheel (8) — DSPy-based judge-vs-human disagreement loop. Dropbox Dash canonical.
  2. patterns/platform-engineering-investment (8) — "small central team, big blast radius" staffing ratio. AWS 3-people / 6000-accounts and Santander Catalyst.
  3. patterns/specialized-agent-decomposition (7), patterns/post-inference-verification (7).
  4. patterns/tool-surface-minimization, patterns/proxyless-service-mesh, patterns/human-calibrated-llm-labeling, patterns/executable-specification, patterns/edit-quiescence-indexing, patterns/bisect-driven-regression-hunt (5 each).

Top recurring systems

  1. systems/dynamodb (17), systems/aws-s3 (16) — the two default scale-out stores.
  2. systems/dropbox-dash (15) — the wiki's most documented product (7 dedicated sources, full agentic-AI stack).
  3. systems/aws-lambda (12), systems/aws-eks (12) — serverless + managed K8s as default AWS compute surfaces.
  4. systems/github (11) — both as source and as subject.
  5. systems/aws-iam (10), systems/aurora-dsql (10) — identity is everywhere; DSQL is the Rust-in-Aurora canonical.
  6. systems/dash-search-index (9), systems/bedrock-guardrails-automated-reasoning-checks (9), systems/amazon-route53 (9).
  7. systems/kiro (8), systems/envoy (8), systems/elasticsearch (8), systems/bits-ai-sre (8).

All trend claims are counts within this 91-source sample; extrapolate carefully.

1. The agent wave is the dominant 2026 story

20 of 91 sources touch agents / DSPy / LLM-as-judge / MCP. Concentrated in 2026 (17 of 20). Anchors: systems/dropbox-dash (agentic redesign around concepts/context-engineering); systems/bits-ai-sre (Datadog); systems/aws-devops-agent; systems/kiro (spec-driven agents); systems/model-context-protocol (MCP) as the emerging tool-interop standard. Pattern signature: patterns/prompt-optimizer-flywheel + patterns/specialized-agent-decomposition + patterns/tool-surface-minimization + patterns/post-inference-verification.

2. Vector + hybrid retrieval is now table-stakes

27 of 91 sources touch AI / search / embeddings / RAG. concepts/hybrid-retrieval-bm25-vectors (7 cites) appears in Dash, Figma AI Search, Expedia Embedding Store, and Dropbox. Storage converging on systems/s3-vectors (July 2025 preview) as the cloud-native vector primitive; systems/opensearch k-NN + DynamoDB + S3 as the 2026 default triad. concepts/vector-quantization and patterns/cold-to-hot-vector-tiering are the cost levers.

3. Storage: object storage as filesystem, format, index

systems/aws-s3 absorbed three new roles in 2025-2026: systems/s3-vectors (vector index as S3 primitive), systems/s3-tables (managed Iceberg), systems/s3-files (mount any bucket via NFS on EC2/ECS). Pattern: boundary as feature — S3 exposes new surface without changing the core durability story. concepts/compute-storage-separation (6 cites) is the enabling premise.

4. Rust is the default for new data-plane code

systems/aurora-dsql (2025 retrospective), Figma memory-optimization work, systems/shardstore, Dropbox systems/dropbox-nucleus. Driver: concepts/tail-latency-at-scale × concepts/memory-safety. JVM tuning doesn't escape the (1−p)^N → 0 trap at fleet scale.

5. Observability: vendor migration + OpenTelemetry convergence

Airbnb's 5-year migration off a vendor stack onto PromQL + OTLP + vmagent streaming aggregation (systems/airbnb-observability-platform) is the canonical 2026 story. Concerns visible: concepts/alert-fatigue, patterns/alert-backtesting, concepts/metric-granularity-mismatch.

6. Security / sovereignty / DR as first-class concerns

concepts/digital-sovereignty (7 cites) emerged in 2026 with AWS's sovereign-cloud / cross-partition / European-sovereign-cloud posts. concepts/disaster-recovery-tiers (7 cites): pilot-light / warm- standby / backup-and-restore formalized into named patterns. concepts/post-quantum-cryptography (GitHub SSH, 2025-09). Authorization shifting to systems/amazon-verified-permissions + systems/cedar + patterns/per-tenant-policy-store.

7. Platform engineering with tiny teams

Emerging ratio signal: sources/2026-02-25-aws-6000-accounts-three-people-one-platform (3 engineers / 6000 AWS accounts), systems/santander-catalyst (Santander's internal developer platform), Figma's device-trust rollout. patterns/platform-engineering-investment (8) + patterns/golden-path-with-escapes.

8. Specification-driven + automated reasoning

concepts/specification-driven-development (9 cites) tied to systems/kiro, concepts/automated-reasoning, concepts/autoformalization, systems/bedrock-guardrails-automated-reasoning-checks. Byron Cook ATD interview (2026-02-17) frames neurosymbolic AI as AWS's direction of travel.

Recurring trade-offs and contradictions

Control plane vs data plane — split or unified?

systems/aurora-dsql ⚠️ contradiction across versions: initial split (Kotlin CP, Rust DP) → retracted in favor of unified Rust after operational cost of two stacks. Sitar, Figcache, Dicer all kept the split. Takeaway (see concepts/control-plane-data-plane-separation): split for independent evolution + different scaling profiles; unify when operator cost of two languages/stacks exceeds the isolation win.

In-house shard vs CockroachDB/TiDB/Spanner/Vitess

Figma databases retrospective (sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale) rejected every alternative under growth pressure; chose in-house sharding on RDS Postgres via systems/dbproxy-figma + colos. This directly contradicts the Spanner / NewSQL thesis; the deciding factors were migration risk + operational expertise, not technical inferiority.

MERGE INTO vs INSERT OVERWRITE on Iceberg

Expedia 2025-09-30 (patterns/merge-into-over-insert-overwrite) prescribes row-level updates over partition-overwrite. Canonical guidance for any Iceberg shop.

Postgres extension vs fork

Databricks Lakebase (systems/lakebase, patterns/postgres-extension-over-fork) built on systems/pageserver-safekeeper via extensions rather than forking Postgres. Complementary to AWS Aurora's internal-fork approach.

Batch vs streaming ingestion

Multiple sources, no winner. patterns/hybrid-batch-streaming-ingestion (Dropbox Dash Feature Store, Expedia Embedding Store) is the frequent answer: offline backfill for large corpora, streaming for freshness.

Pilot light vs warm standby for DR

AWS 2026-03-31 formalizes concepts/disaster-recovery-tiers with explicit cost/RTO/RPO trade-offs. patterns/pilot-light-deployment vs patterns/warm-standby-deployment vs patterns/backup-and-restore-tier is now a named choice, not a handwave.

Open questions (candidates for deep-dive)

  • How do sharding-key choices compose across colos? Figma's colo model avoids cross-shard transactions for 90% of queries. What does the 10% look like in practice, and does it break with multi-tenant products?
  • Is proxyless-service-mesh a general pattern or Databricks- specific? systems/databricks-endpoint-discovery-service depends on a Scala monorepo + Armeria client library everywhere. Not clear this generalizes to polyglot fleets.
  • Cost of prompt-optimizer-flywheel vs manual prompt engineering? Dropbox reports win but no published cost-per-optimization-cycle number. DSPy footprint on bills is an open question.
  • Does control-plane-data-plane-separation always survive operational reality? DSQL retracted it. When does unification win?
  • Are there successful multi-region active-active products that don't rely on a proprietary substrate (Spanner / DSQL / Cosmos)? The wiki shows failover patterns but few true active-active stories.

Data sparseness warnings

  • Only 1 source each for Cloudflare, Atlassian — don't generalize "Cloudflare thinks X" or "Atlassian does Y" from these.
  • No Netflix, Meta, LinkedIn, Uber, Stripe, Google sources yet. Claims about microservices / CRDTs / consensus / payments are under-supported and should be caveated.
  • Figma's 19-source count is an artifact of a 2026-04-21 batch ingest (~15 posts the same day). Normalize before inferring "Figma publishes more than AWS."
  • AI/agent trend is 17/91 sources; strong signal but not yet majority. Don't mistake for "everyone is doing agents."
  • Many pattern pages have ≤2 source citations — they are provisional generalizations from narrow evidence. Always check Seen in list before citing as industry practice.
Last updated · 178 distilled / 1,178 read