System Design — Overview¶
Synthesis across the wiki's ingested corpus. Sample: 91 sources /
11 companies / 268 system pages / 253 concept pages / 218 pattern
pages (as of 2026-04-21). Skewed toward 2025-2026 (78 of 91 sources
dated 2025 or later). Companies and sources are cited via
[[wiki-links]] — drill in for evidence.
Corpus shape¶
| Company | Sources | Tier | Bias |
|---|---|---|---|
| companies/figma | 19 | 3-eq | Client-perf, C++/WASM build, data caching, sharding, DSLs |
| companies/aws | 18 | 1 | Service-design, multi-tenancy, EKS platform, GenAI infra |
| companies/dropbox | 11 | 2 | Sync-engine testing, hardware, Magic Pocket, Dash agentic AI |
| companies/allthingsdistributed | 9 | 1 | S3, DSQL, Lambda PR/FAQs, Byron Cook on automated reasoning |
| companies/github | 7 | 2 | Git server internals, search rewrites, security (PQC, SAML) |
| companies/datadog | 7 | 3-eq | Husky event store, eBPF security, Go runtime, Bits AI SRE |
| companies/airbnb | 6 | 2 | Observability migration, dynamic config, destination ML |
| companies/databricks | 5 | 3 | Proxyless mesh, Dicer auto-sharder, cross-cloud data mesh |
| companies/expedia | 4 | 3 | Iceberg, Kafka Streams, embedding platform, Trino |
| companies/canva | 3 | — | Event counting, CI builds, print-order routing |
| companies/cloudflare, companies/atlassian | 1 each | 1 / 3 | Internal AI stack / streaming SSR |
Gaps: no Netflix, Meta, LinkedIn, Uber, Stripe sources yet — the
canonical Tier-1 distributed-systems blogs are under-represented
relative to their promise in raw/_feeds.yaml.
Top recurring concepts (by source citation count)¶
Minimum 6 citations. Full cite-counts via
grep -ohE 'concepts/[a-z0-9-]+' wiki/sources/*.md | sort | uniq -c.
- concepts/control-plane-data-plane-separation (15) — universal architectural primitive. Appears in config platforms (systems/sitar, systems/figcache), load balancers (systems/dropbox-robinhood), service mesh (systems/databricks-endpoint-discovery-service), sharding (systems/dicer), platform engineering (systems/santander-catalyst).
- concepts/tail-latency-at-scale (13) — forcing function behind "why not JVM" decisions. Canonical quantification is the systems/aurora-dsql Crossbar simulation: 40 hosts × 1s stalls → 10s tail, 6K TPS vs target 1M.
- concepts/llm-as-judge (11) — ubiquitous in the 2025-2026 agent wave. systems/dspy is the common tooling.
- concepts/observability (10) — Airbnb's 5-year OTLP migration, Datadog Husky internals, AWS telemetry-based-discovery.
- concepts/specification-driven-development (9) — Kiro, autoformalization, executable specs, lightweight formal verification (ShardStore).
- concepts/stateless-compute (8) + concepts/shared-responsibility-model (8) + concepts/lightweight-formal-verification (8) + concepts/agent-context-window (8).
- concepts/vector-similarity-search (7) + concepts/simplicity-vs-velocity (7) + concepts/neurosymbolic-ai (7) + concepts/managed-data-plane (7) + concepts/immutable-object-storage (7) + concepts/hybrid-retrieval-bm25-vectors (7) + concepts/disaster-recovery-tiers (7) + concepts/digital-sovereignty (7).
- concepts/rag-as-a-judge, concepts/noisy-neighbor, concepts/memory-safety, concepts/knowledge-graph, concepts/compute-storage-separation, concepts/automated-reasoning — all 6 citations.
Top recurring patterns¶
- patterns/prompt-optimizer-flywheel (8) — DSPy-based judge-vs-human disagreement loop. Dropbox Dash canonical.
- patterns/platform-engineering-investment (8) — "small central team, big blast radius" staffing ratio. AWS 3-people / 6000-accounts and Santander Catalyst.
- patterns/specialized-agent-decomposition (7), patterns/post-inference-verification (7).
- patterns/tool-surface-minimization, patterns/proxyless-service-mesh, patterns/human-calibrated-llm-labeling, patterns/executable-specification, patterns/edit-quiescence-indexing, patterns/bisect-driven-regression-hunt (5 each).
Top recurring systems¶
- systems/dynamodb (17), systems/aws-s3 (16) — the two default scale-out stores.
- systems/dropbox-dash (15) — the wiki's most documented product (7 dedicated sources, full agentic-AI stack).
- systems/aws-lambda (12), systems/aws-eks (12) — serverless + managed K8s as default AWS compute surfaces.
- systems/github (11) — both as source and as subject.
- systems/aws-iam (10), systems/aurora-dsql (10) — identity is everywhere; DSQL is the Rust-in-Aurora canonical.
- systems/dash-search-index (9), systems/bedrock-guardrails-automated-reasoning-checks (9), systems/amazon-route53 (9).
- systems/kiro (8), systems/envoy (8), systems/elasticsearch (8), systems/bits-ai-sre (8).
Trends observed (2025 → 2026)¶
All trend claims are counts within this 91-source sample; extrapolate carefully.
1. The agent wave is the dominant 2026 story¶
20 of 91 sources touch agents / DSPy / LLM-as-judge / MCP. Concentrated in 2026 (17 of 20). Anchors: systems/dropbox-dash (agentic redesign around concepts/context-engineering); systems/bits-ai-sre (Datadog); systems/aws-devops-agent; systems/kiro (spec-driven agents); systems/model-context-protocol (MCP) as the emerging tool-interop standard. Pattern signature: patterns/prompt-optimizer-flywheel + patterns/specialized-agent-decomposition + patterns/tool-surface-minimization + patterns/post-inference-verification.
2. Vector + hybrid retrieval is now table-stakes¶
27 of 91 sources touch AI / search / embeddings / RAG. concepts/hybrid-retrieval-bm25-vectors (7 cites) appears in Dash, Figma AI Search, Expedia Embedding Store, and Dropbox. Storage converging on systems/s3-vectors (July 2025 preview) as the cloud-native vector primitive; systems/opensearch k-NN + DynamoDB + S3 as the 2026 default triad. concepts/vector-quantization and patterns/cold-to-hot-vector-tiering are the cost levers.
3. Storage: object storage as filesystem, format, index¶
systems/aws-s3 absorbed three new roles in 2025-2026: systems/s3-vectors (vector index as S3 primitive), systems/s3-tables (managed Iceberg), systems/s3-files (mount any bucket via NFS on EC2/ECS). Pattern: boundary as feature — S3 exposes new surface without changing the core durability story. concepts/compute-storage-separation (6 cites) is the enabling premise.
4. Rust is the default for new data-plane code¶
systems/aurora-dsql (2025 retrospective), Figma memory-optimization
work, systems/shardstore, Dropbox systems/dropbox-nucleus. Driver:
concepts/tail-latency-at-scale × concepts/memory-safety. JVM tuning
doesn't escape the (1−p)^N → 0 trap at fleet scale.
5. Observability: vendor migration + OpenTelemetry convergence¶
Airbnb's 5-year migration off a vendor stack onto PromQL + OTLP + vmagent streaming aggregation (systems/airbnb-observability-platform) is the canonical 2026 story. Concerns visible: concepts/alert-fatigue, patterns/alert-backtesting, concepts/metric-granularity-mismatch.
6. Security / sovereignty / DR as first-class concerns¶
concepts/digital-sovereignty (7 cites) emerged in 2026 with AWS's sovereign-cloud / cross-partition / European-sovereign-cloud posts. concepts/disaster-recovery-tiers (7 cites): pilot-light / warm- standby / backup-and-restore formalized into named patterns. concepts/post-quantum-cryptography (GitHub SSH, 2025-09). Authorization shifting to systems/amazon-verified-permissions + systems/cedar + patterns/per-tenant-policy-store.
7. Platform engineering with tiny teams¶
Emerging ratio signal: sources/2026-02-25-aws-6000-accounts-three-people-one-platform (3 engineers / 6000 AWS accounts), systems/santander-catalyst (Santander's internal developer platform), Figma's device-trust rollout. patterns/platform-engineering-investment (8) + patterns/golden-path-with-escapes.
8. Specification-driven + automated reasoning¶
concepts/specification-driven-development (9 cites) tied to systems/kiro, concepts/automated-reasoning, concepts/autoformalization, systems/bedrock-guardrails-automated-reasoning-checks. Byron Cook ATD interview (2026-02-17) frames neurosymbolic AI as AWS's direction of travel.
Recurring trade-offs and contradictions¶
Control plane vs data plane — split or unified?¶
systems/aurora-dsql ⚠️ contradiction across versions: initial split (Kotlin CP, Rust DP) → retracted in favor of unified Rust after operational cost of two stacks. Sitar, Figcache, Dicer all kept the split. Takeaway (see concepts/control-plane-data-plane-separation): split for independent evolution + different scaling profiles; unify when operator cost of two languages/stacks exceeds the isolation win.
In-house shard vs CockroachDB/TiDB/Spanner/Vitess¶
Figma databases retrospective (sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale) rejected every alternative under growth pressure; chose in-house sharding on RDS Postgres via systems/dbproxy-figma + colos. This directly contradicts the Spanner / NewSQL thesis; the deciding factors were migration risk + operational expertise, not technical inferiority.
MERGE INTO vs INSERT OVERWRITE on Iceberg¶
Expedia 2025-09-30 (patterns/merge-into-over-insert-overwrite) prescribes row-level updates over partition-overwrite. Canonical guidance for any Iceberg shop.
Postgres extension vs fork¶
Databricks Lakebase (systems/lakebase, patterns/postgres-extension-over-fork) built on systems/pageserver-safekeeper via extensions rather than forking Postgres. Complementary to AWS Aurora's internal-fork approach.
Batch vs streaming ingestion¶
Multiple sources, no winner. patterns/hybrid-batch-streaming-ingestion (Dropbox Dash Feature Store, Expedia Embedding Store) is the frequent answer: offline backfill for large corpora, streaming for freshness.
Pilot light vs warm standby for DR¶
AWS 2026-03-31 formalizes concepts/disaster-recovery-tiers with explicit cost/RTO/RPO trade-offs. patterns/pilot-light-deployment vs patterns/warm-standby-deployment vs patterns/backup-and-restore-tier is now a named choice, not a handwave.
Open questions (candidates for deep-dive)¶
- How do sharding-key choices compose across colos? Figma's colo model avoids cross-shard transactions for 90% of queries. What does the 10% look like in practice, and does it break with multi-tenant products?
- Is
proxyless-service-mesha general pattern or Databricks- specific? systems/databricks-endpoint-discovery-service depends on a Scala monorepo + Armeria client library everywhere. Not clear this generalizes to polyglot fleets. - Cost of
prompt-optimizer-flywheelvs manual prompt engineering? Dropbox reports win but no published cost-per-optimization-cycle number. DSPy footprint on bills is an open question. - Does
control-plane-data-plane-separationalways survive operational reality? DSQL retracted it. When does unification win? - Are there successful multi-region active-active products that don't rely on a proprietary substrate (Spanner / DSQL / Cosmos)? The wiki shows failover patterns but few true active-active stories.
Data sparseness warnings¶
- Only 1 source each for Cloudflare, Atlassian — don't generalize "Cloudflare thinks X" or "Atlassian does Y" from these.
- No Netflix, Meta, LinkedIn, Uber, Stripe, Google sources yet. Claims about microservices / CRDTs / consensus / payments are under-supported and should be caveated.
- Figma's 19-source count is an artifact of a 2026-04-21 batch ingest (~15 posts the same day). Normalize before inferring "Figma publishes more than AWS."
- AI/agent trend is 17/91 sources; strong signal but not yet majority. Don't mistake for "everyone is doing agents."
- Many pattern pages have ≤2 source citations — they are
provisional generalizations from narrow evidence. Always check
Seen inlist before citing as industry practice.
Navigation hints¶
- Starting systems: systems/aws-s3, systems/dynamodb, systems/aurora-dsql, systems/dropbox-dash — best-documented with most cross-references.
- If you care about agents / AI infra: start at systems/dropbox-dash, then systems/bits-ai-sre, systems/kiro, concepts/context-engineering, patterns/prompt-optimizer-flywheel.
- If you care about sharding / databases: start at concepts/horizontal-sharding, then sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale, systems/aurora-dsql, systems/dicer.
- If you care about storage internals: systems/aws-s3, systems/magic-pocket, systems/husky, systems/shardstore.
- If you care about platform / multi-tenancy: concepts/tenant-isolation, systems/santander-catalyst, patterns/platform-engineering-investment.
- If you care about service mesh / networking: systems/envoy, systems/dropbox-robinhood, patterns/proxyless-service-mesh, patterns/blue-green-service-mesh-migration.
- Full navigation: index is the catalog; each of concepts/index, patterns/index, systems/index, companies/index is curated.