Meta (Facebook)¶

Meta (formerly Facebook) is a hyperscale social / advertising / VR / AI company whose engineering blog (engineering.fb.com) is a Tier-1 source on the sysdesign-wiki. Distinctive Meta systems frequently cited elsewhere in the corpus: the Presto-fronted data warehouse, Tupperware (containers/cluster management), TAO (graph store), Haystack (photo storage), PTP (precision time), Owl (content distribution), and — as of 2024 — the Grand Teton H100 platform underpinning Meta's paired 24K-GPU RoCE + InfiniBand GenAI training clusters on which Llama 3 was trained.

For this wiki, Meta pages primarily come through two channels:

First-party Meta Engineering posts — architectural deep-dives, data warehouse / infra / ML / networking internals.
Meta-authored pieces republished on secondary aggregators like High Scalability — operational retrospectives co-authored by Meta production engineers.

Recent articles¶

2026-06-03 — Lights Out, Systems On: Validating Instant Power Loss Readiness (first-party Meta Engineering; Data Center Engineering) — Meta introduces Instantaneous PowerLoss Storm, a testing paradigm that validates the infrastructure's ability to handle zero-notice complete power loss of an entire data center region. Describes how Meta built power-loss tolerance bottom-up into the DC stack (mechanical/electrical → racks → storage → compute → Twine orchestrator) and validated it by actually de-energizing large production regions with critical storage, AI, and data warehouse workloads. Two critical bootstrapping problems detailed: circular dependencies in Twine control-plane startup (solved via Belljar CI/CD detection + purpose-built Twrko jumpstart capability — canonical belt-and-braces approach) and the "boomerang effect" where UE shutdown signals killed the orchestrator itself (solved by letting control-plane services ignore power-related shutdown signals). Validation followed an incremental blast-radius strategy: pre-production → shadow → smallest production → large production regions. A typical DC region is 50–60× the size of a typical sub-regional fault domain. Thirty-first Meta first-party ingest and first canonical region-scale disaster-recovery validation source on the wiki — complements the fleet-maintenance axis (OpsPlanner 2024-06-16) and the RoCE network resilience axis (SIGCOMM 2024-08-05).
2026-05-26 — SilverTorch: Index as Model, a new retrieval paradigm for recommendation systems (first-party Meta Engineering; ML Applications; SIGIR 2026 paper) — Meta's Recommendation Systems team rebuilds the recsys retrieval stage as SilverTorch, a single PyTorch model in which every component (item index, eligibility filter, scoring layer, user tower) is "a tensor or operator inside" the Index as Model neural network — replacing the prior orchestrator + user-tower + ANN + filter + scoring microservice mesh. Headline outcome on an 80M-item production retrieval workload: 23.7× more requests per second + 20.9× TCO efficiency (13.35× including neural reranking) over a same-architecture multi-service baseline; top-K capability lifts from Faiss-GPU's 2,048 to "hundreds of thousands"; neural reranking + multi-task scoring (like / share / comment composite) become affordable inside the retrieval forward pass under the sub-100 ms latency budget. Three structural failures of the prior microservice mesh drove the swing — "latency lost to data movement" between services, version inconsistency across user-tower / item-index / filter-rules deploying on independent cadences, and siloed ML / infra development. Pure-PyTorch as substrate dissolves the ML / infra boundary ("an engineer working on a new retrieval idea writes PyTorch and only PyTorch" — engineering velocity weeks/months → days) and inherits the PyTorch ecosystem's optimisations for free including torch.compile. Canonical wiki instance of the new patterns/unified-pytorch-model-as-retrieval-system pattern — and the wiki's first ML-pipeline-altitude instance of microservice→monolith pendulum (distinct in scope from the API-surface monolith instances at Airbnb / Uber / Stack Overflow). Two GPU-native retrieval primitives redesigned under the new patterns/gpu-native-retrieval-primitive-redesign pattern: Bloom index filter replaces inverted-index eligibility filtering (CPU posting-lists fight GPU warps via length variance — Bloom signature bit operations are "dense parallel work GPUs are good at"; 291–523× faster than CPU inverted index); fused Int8 ANN search replaces Faiss-GPU (Int8 storage halves memory + leverages dp4a + runs as one kernel; 2.2–14.7× faster than Faiss-GPU, "no measurable recall loss" at 64 probes / top-2048). Probe-then-filter co-design within one model graph cuts filter compute another 30× beyond the per-primitive wins. Index freshness becomes streaming in-place tensor mutation (canonical wiki instance of the new patterns/streaming-in-place-tensor-update pattern + concepts/streaming-model-weight-update concept) — "with index as a model module, maintaining index freshness equates to updating the model weights of a neural network in production, at scale, without taking the model offline." Snapshot publishes for full-model + continuous streaming weight updates between snapshots; "same-day posts now represent a significant portion of recommendations." Placement strategy: canonical wiki instance of the new patterns/scale-up-first-then-scale-out-gpu pattern + concepts/gpu-memory-hierarchy concept — saturate single-GPU SRAM/HBM/host-DRAM/remote-DRAM hierarchy first, then within-host multi-GPU, then document-shard across hosts, with TorchRec handling sparse-table sharding across HBM + GPU host DRAM + remote CPU-host DRAM as the fourth tier. Three-stage development arc: reproduce every baseline retrieval module in PyTorch → rethink each as GPU-native → enable backprop for joint training. Forward-looking: LLMs plug in "as just another module" — Meta's bet that the right substrate for LLM-in-recsys is the same Index-as-Model graph rather than a parallel LLM service alongside the retrieval mesh. "Widely adopted within Meta across different apps" (Facebook + Instagram + Threads feed + video). Companion paper at SIGIR 2026 (arXiv:2511.14881). Thirtieth Meta first-party ingest on the wiki and first canonical recsys-retrieval-substrate ingest from Meta — sibling to the existing model-architecture posts (MARM for ads ranking; Friend Bubbles for Reels recommendation; Groups Search for hybrid retrieval) — by anchoring the GPU-native retrieval-stage substrate axis on which all those models now ride. Architecture-and-results voice — deeper internals (exact Int8 quantisation scheme, Bloom-filter parameters, streaming-update batch shape, per-app QPS, fleet size, vendor mix) deferred to the SIGIR paper. Introduces 3 new systems (systems/silvertorch, systems/torchrec, systems/torch-compile), 8 new concepts (concepts/index-as-model, concepts/fused-int8-ann-search, concepts/bloom-index-filter-gpu, concepts/gpu-memory-hierarchy, concepts/streaming-model-weight-update, concepts/multi-task-retrieval-scoring, concepts/version-skew-microservice-retrieval, concepts/document-sharding), and 4 new patterns (patterns/unified-pytorch-model-as-retrieval-system, patterns/gpu-native-retrieval-primitive-redesign, patterns/streaming-in-place-tensor-update, patterns/scale-up-first-then-scale-out-gpu); extends concepts/ann-index, concepts/two-tower-architecture, concepts/retrieval-ranking-funnel, concepts/multi-task-learning, concepts/monolith-vs-microservices-pendulum, systems/pytorch, systems/faiss with the SilverTorch face.
2026-05-12 — Migrating Data Ingestion Systems at Meta Scale (first-party Meta Engineering; Data Infrastructure) — Meta's migration-discipline post documenting the successful end-to-end transition of its data ingestion system — the daily-petabyte-scale CDC pipeline that scrapes Meta's social graph from one of the world's largest MySQL deployments into the data warehouse — off a legacy customer-owned-pipelines architecture onto a "simpler self-managed data warehouse service that still operates efficiently at hyperscale." Outcome: 100% of workload transitioned, legacy system fully deprecated. Two parallel challenges addressed independently: (A) per-job seamless transition via a canonical three-phase Shadow → Reverse Shadow → Cleanup lifecycle (canonical wiki instance of the new patterns/shadow-then-reverse-shadow-migration pattern — shadow-phase job runs in pre-production writing to a separate shadow table; reverse-shadow phase swaps writers so the shadow job becomes the production-table writer and the original production job is demoted to writing the shadow table — "effectively making the shadow job the new production job" — providing both ongoing post-rollout data-quality signal and a hot rollback path that requires no recreation of the legacy job; canonical extension of parallel-run family with the production-table-writer-swaps variant); three machine-checkable promotion criteria gate every phase transition (data-quality match via row-count + checksum; no landing-latency regression; no resource-utilization regression); plus the new patterns/data-quality-analysis-tool-with-edge-case-logging pattern that aggregates partition-level mismatches to Scuba hourly, runs targeted queries to find example offending rows, and re-logs the augmented analysis to the same store — "this same data quality analysis tool is still in use after the migration as part of the release validation process" (the migration produced a permanent operational tool as a side effect). (B) Tens-of-thousands-of-jobs scale addressed via canonical wiki instance of the new patterns/automated-job-lifecycle-promotion pattern — every job continuously emits its phase + criteria-progress to Scuba; an external tool auto-promotes / auto-demotes jobs between lifecycle phases on the criteria signals (system-level + job-level dashboards as the operator surface) — combined with the new patterns/known-issue-exclusion-batch-selection pattern (defer affected jobs during root-issue remediation to avoid N copies of the same alert flooding the data-quality stream + duplicate full-dump cost) and the new patterns/snapshot-reuse-from-legacy-during-migration pattern (seed the new system from the legacy's most recent snapshot output to bypass the slow + expensive first-full-dump tax — a CDC-specific optimisation that requires both systems share the same source). The architectural substrate underneath both solution sets is the new tri-layer CDC schema: per-job full-dump table + delta table + target table, governed by a central management service (unnamed). Defining hazard: the new concepts/cdc-bad-data-propagation concept — "being a CDC process means the data generated by the system is used again to generate the new data… if previous landed data has any issues the problematic data will be passed to the new landed data" — bounded by the new patterns/partition-marking-stops-cdc-bleeding pattern: bad partitions are marked in metadata (not deleted in-place) so a delta partition halts CDC consumption + alerts; a target partition is substituted with an older known-good partition merged forward, transparent to consumers. Marks also serve as a rollback-substrate index for bulk backfill. Architecture-and-discipline voice — no QPS, no fleet size, no exact job count beyond "tens of thousands", no migration duration, no compute/storage delta achieved between old and new system. Twenty-ninth Meta first-party ingest and first canonical wiki home for Meta's data ingestion system (filling a gap that had been referenced via TAO / MySQL / data-warehouse posts but never anchored to a dedicated page); also introduces the new systems/meta-data-ingestion-system system page, the concepts/migration-job-lifecycle / concepts/shadow-job-pre-production / concepts/reverse-shadow-phase / concepts/landing-latency / concepts/cdc-bad-data-propagation / concepts/partition-quality-marking / concepts/full-dump-vs-delta-vs-target / concepts/data-quality-checksum-comparison concepts, and the patterns/shadow-then-reverse-shadow-migration / patterns/automated-job-lifecycle-promotion / patterns/partition-marking-stops-cdc-bleeding / patterns/known-issue-exclusion-batch-selection / patterns/snapshot-reuse-from-legacy-during-migration / patterns/data-quality-analysis-tool-with-edge-case-logging patterns. Sibling to Sapling (migration-at-monorepo-scale on the source-control axis) + PAI / Policy Zones (logging-mode-to-enforcement-mode rollout shape, structurally analogous gradual-rollout-via-continuous-signal applied to privacy enforcement instead of CDC ingestion).
2026-05-01 — How Meta Is Strengthening End-to-End Encrypted Backups (first-party Meta Engineering; Security) — Short architecture-and-commitment post hardening the infrastructure around the HSM-based Backup Key Vault (the substrate first introduced in Meta's 2021-09-10 post, not ingested) that holds per-user E2EE-backup keys for WhatsApp and Messenger inside tamper-resistant HSMs inaccessible to Meta, cloud providers, or any third party. Substrate recap: geographically distributed fleet across multiple datacenters with majority-consensus replication. Two load-bearing additions: (1) Over-the-air fleet key distribution for Messenger — because Messenger supports new HSM fleets deploying without an app update, fleet public keys arrive inline with HSM responses inside a validation bundle signed by Cloudflare and counter-signed by Meta, with Cloudflare maintaining an audit log of every bundle. Canonical wiki instance of the new patterns/ota-fleet-public-key-distribution pattern, the new patterns/third-party-countersignature-for-trust-anchor pattern (Cloudflare + Meta as two independent trust authorities), and the new concepts/trust-anchor-distribution concept (three shapes now canonical across Meta: hardcoded-in-binary = WhatsApp vault; OTA validation bundle = Messenger vault; attestation-against-ledger = Private Processing). WhatsApp's hardcoded-in-app posture is contrasted explicitly — WhatsApp's app update cadence makes rotation via app update acceptable; Messenger's product constraint does not. (2) Transparent fleet deployment — Meta commits to publishing evidence of the secure deployment of each new HSM fleet on the Meta Engineering blog; rollovers are infrequent ("typically no more than every few years"); users verify via the Audit section of Meta's Security of E2EE Backups whitepaper. Canonical wiki instance of the new patterns/publish-deployment-evidence-for-transparency pattern. Companion to the 2025-04-30 Private Processing post: both shift key Meta security substrates from "trust us" to "trust a published third-party-held artifact" — that post uses binary-digest ledgers for TEE images; this post uses Cloudflare-signed validation bundles for HSM fleet keys. Architecture-and-commitment voice — no HSM vendor, fleet size, datacenter count, validation-bundle schema, signature algorithm, or rollover cadence beyond "every few years" disclosed; deployment-evidence artifacts are promised but not yet shipped as of 2026-05-01. Twenty-eighth Meta first-party ingest and first canonical wiki home for the HSM Backup Key Vault (filling a gap that had been referenced in passing via prior WhatsApp security posts but never anchored). Also introduces HSM, concepts/majority-consensus-replication, and concepts/audit-log-as-transparency-artifact as first-class wiki concepts; the full-companion to the 2025-04-30 Private Processing post completes Meta's 2025-26 transparency-substrate posture on the wiki (HSM-fleet layer + TEE-binary layer).
2026-04-21 — Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge (first-party Meta Engineering; ML Applications) — Meta re-architects the Facebook Groups scoped-search discussions module from pure keyword retrieval onto a hybrid retrieval architecture (BM25 + dense vectors) with parallel decoupled pipelines (patterns/decoupled-parallel-retrieval-pipelines) sharing a common query-preprocessing stage and merging candidates only at an MTML L2 ranker jointly optimising clicks + shares + comments. Lexical path: Unicorn, Facebook's in-house inverted index since the 2013 Graph Search era — first canonical wiki page. High precision for proper nouns / specific quotes; features TF-IDF + BM25. Semantic path: Search Semantic Retriever (SSR) — a 12-layer, 200M-parameter encoder (first canonical wiki disclosure of SSR size) producing dense query vectors against a precomputed Faiss vector index of group posts (first canonical wiki Faiss page). Feature into L2: cosine similarity. The ranker's job is explicit: "Merging results from two fundamentally different paradigms — sparse lexical features and dense semantic features — required a sophisticated ranking strategy." Quality gated in CI via LLM-judge-in-BVT: Meta integrates a Llama 3 multimodal judge directly into the build-verification-test pipeline, grading search results on a three-tier rubric — exact-match / somewhat-relevant / irrelevant — before any build advances toward production. First Meta-authored LLM-judge-in-CI instance on the wiki; stronger operational stance than offline-eval-leaderboard siblings (Zalando / Instacart). Motivation: "validate quality at scale without the bottleneck of human labeling." The "somewhat relevant" category is explicitly designed for diversity + conceptual-match measurement — "cases where the query and result share a common domain or theme (e.g., different sports are still relevant in a general sports context)." Outcomes framed directionally: "tangible improvements in search engagement and relevance, with no increase in error rates"; L2 Model + EBR (Hybrid) beats lexical baseline on daily-users-performing-search — no percentage lifts, no QPS, no latency, no vector-index cardinality disclosed. Roadmap: LLMs directly in ranking (process post content beyond vector similarity); LLM-driven adaptive retrieval adjusting parameters based on query complexity. Friction-points framing: discovery (natural-language-intent gap in keyword search — "Italian coffee drink" misses "cappuccino"), consumption ("effort tax" of scrolling comments for consensus), validation (using community expertise to validate decisions — Marketplace vintage-Corvette example). Companion paper at arXiv:2509.13603 — not ingested on this wiki. Third Meta-domain MTML instance (recsys Friend Bubbles + ads systems/meta-adaptive-ranking-model + now search) — canonicalises MTML as Meta's default multi-objective ranker architecture across recsys + ads + search.
2026-04-16 — Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways (first-party Meta Engineering; Security) — Meta's Security team publishes a programme-level strategy paper on its multi-year PQC migration. Headline governance primitive: the PQC Migration Levels ladder — PQ-Unaware → PQ-Aware → PQ-Ready → PQ-Hardened → PQ-Enabled — organised around time to react to a relevant quantum event (shorter is better). Even PQ-Ready — "not a desirable end goal given the fact it is not yet protecting the use case against quantum attacks" — is valuable because it "reduces the time to react." PQ-Hardened exists for use cases where literature gaps (efficient PQ-OPRFs) prevent full enablement today. Three-tier prioritisation framework classifies applications by attack class not asset value: High (offline-attackable via Shor — public-key encryption + key exchange — SNDL-vulnerable; split by external-dependency status), Medium (online-attackable via Shor — digital signatures — split by patching capability: hardware-baked keys vs software-upgradable), Low (Grover-only with inadequate parameters — symmetric). Cryptographic inventory via the canonical new patterns/automated-discovery-plus-developer-reporting pattern: Meta's 2024 Crypto Visibility service is the automated-discovery leg ("high-fidelity data on active usage within our primary libraries"), complemented by developer reporting for "edge cases or shadow dependencies" and "cryptographic intent for new architectures". Three external dependencies the migration blocks on (canonical extension of patterns/third-party-dependency-quantum-assessment with the consumer-side angle): (1) community-vetted PQC standards — NIST FIPS 203 / 204 / 205 + HQC drafting; IETF RFC 8554 / 8391 + TLS drafts; Meta co-authored HQC (NIST-selected 2025), BIKE, and Classical McEliece; (2) PQC support in hardware — Meta working with HSM + CPU vendors; (3) production-level implementations — Meta is a Linux Foundation PQCA member and contributes to LibOQS including bug fixes. Algorithm selection: stick to NIST-recommended — ML-KEM-768 default / ML-KEM-512 exception; ML-DSA-65 default / ML-DSA-44 exception; SPHINCS+ and Falcon "considerably harder" to deploy than ML-DSA; HQC is the non-lattice diversity hedge "if weaknesses are discovered in ML-KEM or its modular lattices approach." PQC guardrails (canonical new patterns/crypto-api-guardrails pattern) prevent new vulnerable-code creation via three layers: (1) update internal crypto guidelines; (2) friction on key-generation tooling for vulnerable primitives; (3) build-system rules in Buck that "warn teams during code review" on RSA / ECDH use — the code-review-gating posture applied to crypto APIs. Hybrid over replacement: Meta prioritises layering PQ on top of classical "designed so that the combined system should remain at least as secure as the current standard" — cites SIKE's 2022 cryptanalytic invalidation as precedent forcing caution during the transition period. Four migration principles: Effectiveness, Timeliness, Performance, Cost Efficiency — PQC as constrained-optimisation, not security-at-all-costs. Strategy-paper voice — no production deployment numbers (no fleet percentages, no latency data, no timelines beyond "multi-year"); no specific products named beyond "our internal infrastructure" (though acknowledgements enumerate Transport Security, WhatsApp, Facebook/Messenger, Infrastructure, Reality Labs, Hardware, Payments teams). First canonical PQC-migration-strategy paper on the wiki — complements the inventory-side (sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring|2024-12-02 monitoring), rollout-shape (GitHub / Cloudflare), and disclosure (Google) instances with the governance + program-management angle.
2026-04-16 — Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale (first-party Meta Engineering; Developer Tools) — Meta's Capacity Efficiency team describes a unified AI-agent platform (Capacity Efficiency Platform) for hyperscale performance engineering, built on two layers: MCP Tools (profiling queries · experiment results · configuration history · code search · documentation) + Skills (domain-expertise modules encoding senior-engineer reasoning patterns). Canonical wiki instance of patterns/mcp-tools-plus-skills-unified-platform — "both problems share the same structure... we didn't need two separate AI systems. We needed one platform that could serve both." The platform unifies offense (proactively finding + shipping optimizations) and defense (catching + mitigating regressions) with the same tool layer and different skills. On defense: the AI Regression Solver — a new component of FBDetect (Meta's in-house regression-detection tool, SOSP 2024, catching 0.005% regressions in noisy production, thousands weekly) — fully automates the path from detected regression to fix-forward PR sent to the root-cause author for review. Canonical patterns/ai-generated-fix-forward-pr instance replacing the rollback-vs-ignore binary with automated mitigation. Three-phase pipeline (shared with offense): gather context with tools → apply skill (e.g. "regressions from logging can be mitigated by increasing sampling") → create resolution. On offense: Opportunity Resolver — engineer views candidate optimization → requests AI-generated PR → agent gathers opportunity metadata + pattern docs + examples + files + validation criteria → applies skill (e.g. "memoizing a given function to reduce CPU usage") → produces fix with guardrails (syntax/style/right-issue) → surfaces in engineer's editor for one-click apply. Platform compounding: "within a year, the same foundation powered additional applications: conversational assistants for efficiency questions, capacity-planning agents, personalized opportunity recommendations, guided investigation workflows, and AI-assisted validation. Each new capability requires few to no new data integrations since they can just compose existing tools with new skills." Program-level impact: hundreds of megawatts of power recovered — "enough to power hundreds of thousands of American homes for a year"; investigation compression ~10 hours → ~30 minutes (~20×); AI handling the long tail of per-optimization work "engineers would never get to manually". The end goal is "a self-sustaining efficiency engine where AI handles the long tail." Third framing of capacity efficiency on the wiki alongside 2024-06-16 OpsPlanner (fleet-maintenance axis) and 2024-08-05 DCPerf (hardware-benchmarking axis) — this is the AI-assisted performance engineering axis. Fifth framing of patterns/specialized-agent-decomposition on the wiki — the skill-over-shared-tools composition model. First wiki ingest of FBDetect — Meta's regression detector previously unreferenced (SOSP 2024 paper at tangchq74.github.io/FBDetect-SOSP24.pdf). Sibling to 2026-04-06 Pre-Compute Engine post: both are 2026 Meta bets on markdown-level encoded knowledge as the model-agnostic substrate — compass-shape context files (offline-preloaded / descriptive) there, skills (runtime-invoked / prescriptive) here. Meta now has three complementary operational-AI systems on the wiki: RCA 2024-08-23 (ranker), Pre-Compute Engine 2026-04-06 (offline swarm), Capacity Efficiency Platform 2026-04-16 (tools+skills). Architecture-overview voice — no LLM/model identity, skill catalogue size, merge rate, revert rate, AI-vs-human offense/defense attribution, or guardrail decomposition disclosed.
2026-04-09 — Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases (first-party Meta Engineering; Developer Tools) — Multi-year retrospective on retiring Meta's years-divergent internal WebRTC fork across 50+ RTC use cases — Messenger video calling, Instagram calling, Cloud Gaming, Meta Quest VR casting. Two orthogonal solutions: Solution 1 — dual-stack shim layer: build two WebRTC copies (legacy + upstream) inside a single statically-linked binary gated by a runtime flavor enum, exposing one unified API to callers, so 50+ apps can A/B-test upstream releases against the legacy baseline per-call. Required resolving thousands of C++ One-Definition-Rule violations via automated namespace rewriting (webrtc:: → webrtc_latest:: / webrtc_legacy::), plus C++ using declarations for backward compatibility, AST-based code generation lifting shim velocity from 1/day → 3–4/day, and a Buck-build target-duplication trick for injected components with deep WebRTC-internal dependencies (shimming would mean "proxying WebRTC against itself"). Binary-size choice matters: duplicating the higher call-orchestration library would have cost ~38 MB uncompressed; shimming at the lowest WebRTC layer cost ~5 MB — 87% reduction — canonical wiki datum for layer-placement-as-size-decision (extends concepts/binary-size-bloat). Shim scale: > 10,000 lines of new code; hundreds of thousands modified across thousands of files, "no major issues." Solution 2 — feature branches in an external Git repo: Meta's monorepo lacks the branching surface to track one-branch-per-internal-patch-per-upstream-release; resolution is an external Git repo based directly on libwebrtc's own tree, with tag-anchored branch naming (base/7499 anchors Chromium M143 = libwebrtc tag 7499; debug-tools/7499, hw-av1-fixes/7499; merge forward to <feature>/7559; combine into release candidate r7559). Four named benefits: parallelizable, preserves Git history, LLM-friendly for future auto-conflict-resolution, submit-ready upstream. Canonical wiki instance of the new patterns/external-feature-branch-repo-for-monorepo-patches pattern and the underlying concepts/feature-branch-patch-management concept. Outcomes: launched webrtc/latest at M120, now at M145 ("living at head"); up to 10% CPU drop; up to 3% crash-rate improvement; 100–200 KB compressed binary reduction per-app from upstream's own efficiency wins; deprecated insecure libraries (usrsctp). The shim remains in production as the ongoing-upgrade A/B substrate. Future work: AI agents for build-health fixes + auto-merge-conflict resolution across feature branches — the architecture is explicitly chosen for LLM-automation friendliness. Ninth canonical instance of patterns/upstream-the-fix on the wiki — its dual-stack-A/B-harness variant, orthogonal to the FFmpeg case's "upstream the features, then retire the fork" shape (seventh instance). Canonical pairing with the new pattern patterns/fork-retirement-via-ab-test + patterns/shim-for-dual-stack-ab-testing. Wiki's first canonical RTC / real-time-communication infrastructure source distinct from MLow (audio codec, not library substrate). Architecture-retrospective voice — team-size not disclosed beyond "a small team of engineers"; cost / duration figures absent; per-branch conflict counts + specific script internals abstracted; Buck-duplication path named but not walked-through in detail. Third Meta OSS-posture position disclosed in 2026 (alongside FFmpeg seventh-instance consumer-side upstream and jemalloc eighth-instance stewardship-reset upstream-side) — together the three posts canonicalise Meta's full-spectrum OSS discipline on the wiki.
2026-04-06 — How Meta used AI to map tribal knowledge in large-scale data pipelines (first-party Meta Engineering; Developer Tools) — Meta's Data Platform team describes the AI Pre-Compute Engine that lifts AI-agent code-navigation coverage on a large config-as-code data pipeline (4 repos / 3 languages — Python + C++ + Hack / 4,100+ files / 6 synchronised subsystems per data-field change) from ~5% (5 files) to 100% (59 files). The mechanism: a single-session orchestration of 50+ specialised AI agents — 2 explorers → 11 module analysts → 2 writers → 10+ critics across 3 rounds → 4 fixers → 8 upgraders → 3 prompt testers (55+ queries × 5 personas) → 4 gap-fillers → 3 final critics — that reads every module, answers the five questions (what does this configure / common modification patterns / non-obvious build-failure patterns / cross-module deps / tribal knowledge in comments), and emits 59 compass-not-encyclopedia context files (25-35 lines · ~1,000 tokens each · 4 mandated sections: Quick Commands / Key Files / Non-Obvious Patterns / See Also). Entire knowledge layer < 0.1% of a modern model's context window. Canonical wiki instance of the new patterns/precomputed-agent-context-files pattern — extract tribal knowledge once, offline, via a multi-agent pass, consume it many times at request time. Quantitative outcomes (preliminary on 6 tasks): ~40% fewer tool calls and tokens per task; ~2 days → ~30 min for complex workflow guidance; 3.65 → 4.20 / 5.0 critic quality across 3 rounds; zero hallucinated file paths (enforced invariant); 55+ prompts at 100% pass; 50+ non-obvious patterns documented ("none of this had been written down before") — hidden intermediate naming conventions, append-only deprecated-enum rules, configuration-mode field-name mismatches. Cross-repo dependency index + data-flow maps turn "what depends on X?" from ~6,000 tokens of exploration into a ~200-token graph lookup (30× compression). Self-maintenance loop runs "every few weeks" (canonical wiki instance of patterns/self-maintaining-context-layer and the operational answer to context-file freshness — "context that goes stale causes more harm than no context"): validate file paths, detect coverage gaps, re-run critics, auto-fix stale references. Meta addresses the 2025 academic research that found AI-generated context files hurt agent success on Django / matplotlib: the pretraining-overlap asymmetry inverts on proprietary codebases — compass-shape + opt-in + quality-gated is the triple that avoids the academic pitfall. Multi-round critic quality gate canonical wiki instance (patterns/multi-round-critic-quality-gate) — distinct from runtime concepts/llm-as-judge by being applied pre-release on durable artifacts with a fixer stage between rounds. Orchestration layer routes engineers by natural language ("is the pipeline healthy?" → dashboard scanner + 85+ historical incident patterns from the Meta RCA 2024-08-23 lineage; "add a new data field" → multi-phase validation). Model-agnostic ("works with most leading models because the knowledge layer is model-agnostic") — markdown files, not a proprietary embedding; investment compounds across model upgrades. Fourth framing of patterns/specialized-agent-decomposition on the wiki alongside Storex (domain-based), Dash (sub-tool), DS-STAR (role-in-refinement-loop) — this is the offline-context-generation framing with nine pipeline-stage roles. Apply-it-yourself guidance (5 steps): identify tribal-knowledge gaps → use five-questions framework → compass-not-encyclopedia (25-35 lines) → quality gates via critic agents → automate freshness. Architecture-overview voice — no fleet-wide adoption numbers, total LLM-call count, wall-clock duration of the pre-compute pass, vendor / model version, or specific critic-score acceptance threshold disclosed. Results are one pipeline; Meta names expansion to additional pipelines in Future Work. First Meta ingest whose focus is offline context engineering for AI coding agents on the wiki — sibling to Glean (structured code-index facts) + diff-sketches as Meta's full-spectrum approach to making large proprietary code machine-navigable.
2026-04-02 — KernelEvolve: How Meta's Ranking Engineer Agent Optimizes AI Infrastructure (first-party Meta Engineering; Developer Tools) — Meta describes KernelEvolve, an agentic kernel-authoring system used by Meta's Ranking Engineer Agent (REA) to autonomously generate and optimize production-grade kernels across Meta's heterogeneous AI hardware fleet — NVIDIA GPUs + AMD GPUs + Meta's custom MTIA silicon + CPUs — in high-level DSLs (Triton / TLX, CuTe DSL, FlyDSL) and low-level backends (CUDA, HIP, MTIA C++). The core architectural move is to reframe kernel optimization as a search problem (canonical wiki instance of kernel-optimization-as-search): rather than one-shot LLM code generation, a tree-search engine drives hundreds of candidate kernels through a purpose-built long-running evaluation harness (compile → correctness-check vs PyTorch → profile hardware utilization → produce structured diagnostic reports) that feeds why-signal (memory-bound vs compute-bound vs occupancy-limited) back into an LLM synthesizer — canonical wiki instance of patterns/evaluation-harness-in-agent-loop. Six components: (1) LLM Synthesizer with dynamic context-aware prompts replacing per-task-per-platform templates; (2) Tree Search Engine with MCTS + evolutionary strategies and configurable memory operator per node (inherit-parent / compare-siblings / combine-both / clean-slate — the structural primitive beyond naive independent sampling) — canonical wiki instance of patterns/tree-search-over-llm-candidates; (3) Retrieval-Augmented Knowledge Base with three categories (correctness constraints / platform-agnostic optimization guidance / hardware-specific documentation) plus self-evolving skill library distilling successful strategies from past runs — canonical wiki instance of in-context RL and patterns/rag-over-hardware-documentation; (4) Automated Evaluation Framework composing TritonBench + PyTorch Profiler + NCU + Proton + MTIA Insight via a compiler-centric abstraction using job graphs with MLIR-level instrumentation; (5) Shared Data Foundation — every session contributes to a compounding store, "early adopters do the hard exploration, later sessions inherit near-optimal starting points"; (6) Agentic Reinforcement Learning where trajectories post-train smaller specialized models with kernel-performance reward — canonical wiki instance of patterns/agentic-rl-from-production-signal. The MTIA result is the structural headline: "Because these chips are proprietary, no public LLM has been trained on MTIA code. ... KernelEvolve solves this through systematic knowledge injection." Canonical wiki instance of concepts/hardware-proprietary-knowledge-injection — engineering-cost inversion: "When a new chip arrives, the engineering cost shifts from writing thousands of kernels by hand to curating a set of hardware documents and injecting them into the knowledge base." Motivating forcing function: the {hardware types × generations × model architectures × operators} combinatorial explosion — MTIA alone spans four chip generations in two years (MTIA 300–500) — canonical wiki statement of concepts/heterogeneous-ai-accelerator-fleet as engineering-scaling crisis. Production numbers over a torch.compile + vendor libraries baseline: >60% inference throughput improvement on the Andromeda Ads retrieval model on NVIDIA GPUs (baseline matters — KernelEvolve beats aggressively optimized code, not naive code); >25% training throughput improvement on an unnamed ads model on MTIA. Benchmark: 100% pass rate on KernelBench (Stanford's 250-problem suite, all three difficulty levels) + 480 PyTorch-ATen-operator configurations validated correct across three hardware platforms (160 operators × 3 platforms). Scale framing: "In Meta's production environment, KernelEvolve is optimizing code that serves trillions of daily inference requests." Development velocity: weeks of expert effort → hours of automated search + evaluation. Canonical wiki instance of concepts/agentic-kernel-synthesis. Positioning within REA: ML Exploration (2026-03-17 post, not ingested) discovers better models; KernelEvolve makes them production-ready. Future applications: "hybrid model search, compiler optimization, memory management, and system configuration". Paper at ISCA 2026 (arXiv:2512.23236) not ingested on this wiki. Twenty-seventh Meta ingest on the wiki and second Meta AI-infrastructure ingest from 2026 (MARM 2026-03-31 covered the model-serving axis; this post covers the kernel-codegen axis). Architecture-overview voice — deep internals deferred to the ISCA paper; fleet-size for KernelEvolve's own compute, per-session search budget + wall-clock, specific kernels driving the Andromeda 60% / MTIA 25% headlines, weight-shared-LLM-synthesizer vs per-hardware specialized models in the RL pipeline, and the human-review gate before production deployment are not disclosed. First wiki instance of agentic-kernel-synthesis at hyperscale; opens the AI-infrastructure-agents axis on Meta.
2026-03-31 — Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads (first-party Meta Engineering; ML Applications) — Meta Ads's architecture post on Meta Adaptive Ranking Model, the serving stack that scales Ads ranking to LLM-scale model complexity (O(10 GFLOPs) per token — "equivalent to the [...] compute used by top-tier LLMs") under sub-second latency and O(100 ms) bounded per-request. The frame is the inference trilemma — model complexity vs. latency vs. cost, where scaling any one axis naively breaks the other two — and Meta's resolution is three pillars: (1) request-centric inference — shift the unit of inference from (user, ad-candidate) pairs to (request); heavy user-context computed once per request via request-oriented computation sharing + in-kernel broadcast (transforms scaling "from linear to sub-linear"); long user sequences handled via request-oriented sequence scaling + centralised KV store of user logs joined with training data on the fly; plus the Wukong Turbo runtime evolution of Meta Ads's 2024 Wukong architecture adding no-bias numerical stability, small-parameter FSDP→DDP delegation, and sparsity-based linear-layer simplification; (2) model-system co-design — selective FP8 (post-training quantisation applied only to micro-benchmark-verified precision-tolerant layers, canonical selective mixed-precision instance) + operator fusion for shared inputs + Grouped GEMM + horizontal fusion consolidating "thousands of small operations" into compute-dense kernels, driven by hardware-aware model architecture — outcome: 35% MFU across multiple hardware types (canonical recsys-serving instance extending MFU beyond the existing Voyage AI embedding-inference datum); (3) multi-card sharded embedding serving — splits embedding tables exceeding single-GPU memory across an optimised hardware cluster, achieving performance parity with single-card setups and unblocking O(1T) parameter scale; combined with unified embeddings (multiple features share one table) and sparsity-aware + pruning allocation to manage the underlying hash-collision embedding tradeoff. Also disclosed: feature preprocessing offloaded from client CPU to remote GPU hosts with GPU-native kernels reducing Top-K from O(N log N) to O(N); accelerated model loading via multi-stream downloading + remote caching loads trillion-parameter models in under 10 minutes; auto-scaling on streaming multiprocessor utilisation. Launch: Instagram Q4 2025 — +3% ad conversions and +5% CTR for targeted users (see systems/meta-instagram). Future roadmap: ultra-low precision quantisation beyond FP8, agentic optimisation frameworks auto-adapting kernel performance, near-instantaneous model freshness via incremental in-place weight updates. Architecture-overview voice — no absolute QPS, fleet size, GPU count, inference p-tail, vendor mix (H100/B200/MI300X/MTIA undisclosed), prior-system baseline, FP8-selection benchmark metric, or cross-shard lookup latency disclosed; "+3%/+5%" is framed for targeted-user populations, not overall fleet. First LLM-scale recsys-serving ingest on the wiki — complements the existing recsys ingests (Meta Friend Bubbles recommendation architecture 2026-03-18; Meta RCA 2024-08-23 LLM retrieve-then-rank; Instacart generative recommendations) by focusing on the serving-stack / inference / GPU-systems axis rather than the model-architecture or feedback-loop axis. Second Wukong-family ingest after the 2024 Wukong paper (arXiv:2403.02545, not ingested); extends the Meta GPU-serving corpus (systems/grand-teton-era 2024 training posts) from training into LLM-scale inference serving.
2026-03-18 — Friend Bubbles: Enhancing Social Discovery on Facebook Reels (first-party Meta Engineering; ML Applications) — Meta's architecture overview of the Friend Bubbles recommendation-system component in Facebook Reels — the avatar bubbles that show which of a viewer's friends interacted with a Reel, with a tap opening a one-on-one conversation. Three architectural layers: (1) viewer-friend closeness — two complementary ML models, a survey-trained closeness model running weekly inference over "trillions of person-to-person connections across Facebook friends" (canonical wiki instance of patterns/survey-trained-closeness-model; binary close-vs-not-close prediction asked to randomly-sampled users directly, proxy questions like "how often two people communicate" as additional signal; features: mutual friends, connection strength, interaction patterns, user-provided location, friend count, posts shared) plus a context-specific closeness model trained on on-platform bubble-click signals; (2) retrieval → ranking funnel modifications — friend-interacted candidates are explicitly retrieved ("expand the top of the funnel to ensure sufficient candidate volume for downstream ranking stages" — canonical wiki statement of retriever-recall-is-the-ceiling in a recommendation system) and new features + new tasks are added to early-stage and late-stage MTML ranking models, with the objective augmented by a conditional-probability term P(video engagement | bubble impression) balanced by tunable weights against existing watch/like/comment objectives. A continuous feedback loop re-trains on bubble-interaction data; (3) client-side integration with Reels performance constraints — bubble metadata retrieval pinned to the existing video prefetch window (canonical wiki instance of prefetch-window metadata co-attending: bubble data arrives alongside video content, reusing the already optimised fetch path, cache, and CPU budget — "eliminating mid-playback UI updates and redraws"), animation conditional on interaction state + device class (disabled during active scroll, disabled entirely on low-end devices — canonical wiki instance of patterns/conditional-animation-for-scroll-performance, extending Meta's low-end-device-inclusion posture from the MLow audio codec to the client-UI-rendering axis), and conservative prevalence gating ("bubbles show up only when the relationship signal ... is strong" — prevalence is not the optimisation target; engagement quality is). Qualitative outcomes: "higher interest scores and more positive sentiment ratings", "more time actively watching", "growth concentrated in longer sessions"; expressive reactions (love/laughter) drive stronger downstream engagement than likes; engagement scales with bubble count per video. Future work named: expansion to additional surfaces + inventory, cold-start improvements for sparse friend graphs, refined ranking + feedback signals. First recommendation-system ingest from Meta on the wiki — the 2024-08-23 Meta RCA post introduced the patterns/retrieve-then-rank-llm pattern in an RCA context; Friend Bubbles is the recommendation-system canonical instance of the same funnel primitive in an MTML (not LLM) stage-2 ranker. Architecture-preview voice — no fleet size, QPS, latency, A/B lift, MTML topology, feature-vector dimension, prefetch-window duration, or low-end-device threshold disclosed. Closeness-model composition (survey + context-specific) is named but not specified. Cold-start acknowledged as open problem.
2026-03-02 — Investing in Infrastructure: Meta's Renewed Commitment to jemalloc (first-party Meta Engineering; Data Infrastructure; 515 HN points) — Meta's short stewardship-reset statement for jemalloc, the high-performance memory allocator it maintains upstream. Two substantive disclosures: (1) Meta acknowledges that "in recent years, there has been a gradual shift away from the core engineering principles that have long guided jemalloc's development" and the resulting technical debt slowed progress; (2) Meta has met with the community including founder Jason Evans, unarchived the original GitHub repository (archived in 2024 — the visible signal of the drift), and committed to a four-axis roadmap: technical debt reduction, HPA improvements for transparent huge pages (THP) CPU efficiency, memory efficiency (packing / caching / purging), and AArch64 (ARM64) out-of-box performance. Framing: foundational software components "alongside the Linux kernel and the compilers" need "the highest rigor" and "strong self-discipline as an organization to resist [short-term-benefit] temptation and adhere to the core engineering principles." Canonical wiki instance of the new patterns/stewardship-reset-for-foundational-oss pattern — the upstream-steward-itself sibling of patterns/upstream-the-fix (which is the downstream-consumer case). The consumer pattern works only when the steward is functioning; this pattern is what the steward does when it recognises it has drifted. Announcement voice — no shipping dates, no scope-of-refactor detail, no perf baselines or targets disclosed; reset is evaluated over years of shipped-roadmap work, not at announcement time. "We know that trust is earned through action." Cross-source framing: promotes the existing systems/jemalloc stub (previously only seen via the 2025-03-07 Strobelight memory-profiler-backend usage) to a first-class Meta foundational-software page. Extends patterns/upstream-the-fix with its eighth canonical instance as the upstream-steward-itself variant.
2026-03-09 — FFmpeg at Meta: Media Processing at Scale (first-party Meta Engineering; Video Engineering; 281 HN points) — Meta's Video Engineering team describes how Meta fully deprecated its internal FFmpeg fork for all DASH VOD + livestreaming pipelines by co-developing two load-bearing features upstream with FFlabs and VideoLAN over multiple releases: (1) threaded multi-lane transcoding — "the most complex refactoring of FFmpeg in decades" spanning FFmpeg 6.0 → 8.0 — that produces multiple DASH encodings from a single decode with per-frame encoder parallelism (see patterns/deduplicate-decode-across-encoder-lanes); (2) in-loop decoding (FFmpeg 7.0+) inserting a decoder after each encoder so reference quality metrics (PSNR/SSIM/VMAF — concepts/visual-quality-metric) can be computed in real time during livestreams (see concepts/in-loop-quality-metrics + patterns/in-loop-decoder-for-realtime-quality-metrics). Scale frame: Meta runs ffmpeg + ffprobe tens of billions of times per day; > 1 billion video uploads per day, each requiring multiple FFmpeg executions — so per-process efficiency wins compound to fleet-level savings, and carrying an internal fork is a long-term liability worth spending years upstream to remove. The opposite decision in the same post: Meta keeps its MSVP (Meta Scalable Video Processor) ASIC FFmpeg patches internal because MSVP hardware is Meta-only and external FFmpeg developers cannot validate changes without it; Meta accepts the reverse-rebase cost against each new upstream release. This introduces the complement pattern patterns/keep-infrastructure-specific-patches-internal to the wiki alongside patterns/upstream-the-fix — together they form the decision framework the post makes explicit. MSVP integrates into FFmpeg via the same hardware-accelerated video codec API that exposes NVIDIA NVDEC/NVENC, AMD UVD, and Intel Quick Sync Video. Seventh canonical instance of patterns/upstream-the-fix on the wiki — and the highest-stakes outcome to date: not a single targeted PR but a multi-year / multi-release collaboration culminating in fork retirement. First video-transcoding-infrastructure source on the wiki — opens a new technical domain distinct from the prior MLow audio codec (RTC audio) and the storage/GenAI/privacy/source-control corpus. No fleet CPU-savings numbers, codec mix (H.264/H.265/AV1), or ladder depth disclosed; MSVP's own architecture is linked to the separate 2023 MSVP post (not ingested).
2026-01-28 — Rust at Scale: An Added Layer of Security for WhatsApp (first-party Meta Engineering; Security; 266 HN points) — WhatsApp security-team doctrine post disclosing the Rust rewrite of wamedia — the cross-platform media-consistency library that processes untrusted media automatically on download — and describing the four-family format-check ensemble Kaleidoscope that runs on top of it. Headline datum: "We replaced 160,000 lines of C++ (excluding tests) with 90,000 lines of Rust (including tests). The Rust version showed performance and runtime memory usage advantages over the C++." Scale claim: "the largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of" — billions of devices, WhatsApp + Messenger + Instagram, shipping monthly across Android / iOS / Mac / Web / Wearables. Forcing function (2015 Stagefright): "The bug lay in the processing of media files by operating system-provided libraries, so WhatsApp and other applications could not patch the underlying vulnerability" — canonical concepts/os-library-vulnerability-ungovernable + concepts/patch-lag instance. Meta's response was the format-aware malware check before OS handoff pattern — modify wamedia to detect non-conformant MP4s + refuse to forward. The 2026 Rust rewrite is the follow-on investment to ensure the checker itself is memory-safe — "because media checks run automatically on download and process untrusted inputs, we identified early on that wamedia was a prime candidate for using a memory safe language." Canonical patterns/memory-safe-language-for-untrusted-input instance. Rewrite methodology: parallel-rewrite-in-parallel, not incremental — "we developed the Rust version of wamedia in parallel with the original C++ version. We used differential fuzzing and extensive integration and unit tests to ensure compatibility between the two implementations." Canonical patterns/parallel-rewrite-with-differential-testing + concepts/differential-fuzzing instance. Kaleidoscope's four check families: (1) non-conformant-structure checks to defeat parser-differential exploits on downstream OS libraries; (2) risk-indicator checks inside higher-risk types (PDFs: embedded files + scripting); (3) file-type spoofing detection; (4) dangerous-type uniform flagging (executables/apps). Meta's three-pillar strategy (verbatim): (1) design the product to minimize unnecessary attack surface; (2) invest in security assurance for remaining C/C++ code; (3) "default the choice of memory safe languages, and not C and C++, for new code." Canonical concepts/attack-surface-minimization + concepts/defense-in-depth instance on the client-side / media-processing axis. Two disclosed hurdles: initial binary-size increase from the Rust stdlib (concepts/binary-size-bloat); build-system support for diverse platforms (concepts/cross-platform-client-library tax) — Meta calls this "a long-term bet to build that support." Meta's adjacent security posture: default E2EE for 3B+ daily users, E2E-encrypted backups, key transparency, calling protections, published CVEs even without evidence of exploitation, external audits (NCC Group's public assessment), fuzzing, static analysis, supply-chain management, automated attack-surface analysis, and the new WhatsApp Research Proxy — Meta's bug-bounty research-proxy primitive introduced via the 15th-anniversary Bug Bounty expansion. C/C++ remaining-code hardening named: CFI, hardened memory allocators, safer buffer APIs, specialised security training, automated security analysis, strict SLAs. Positioned within Meta's Rust adoption: "Security teams at WhatsApp and Meta are highlighting opportunities for high impact adoption of Rust to interested teams, and we anticipate accelerating adoption of Rust over the coming years." Thirteenth Meta first-party ingest and first canonical client-side Rust-at-scale source on the wiki, complementing the server-side Rust corpus (Aurora DSQL, Dropbox Nucleus, Cloudflare FL2).
2025-10-06 — Introducing OpenZL: An Open Source Format-Aware Compression Framework (first-party Meta Engineering, 434 HN points) — Meta open-sources OpenZL, a new lossless format-aware compression framework that targets structured data (tabular, columnar, numeric arrays, timeseries, ML tensors, database pages) and claims ratios "comparable to specialized compressors" while preserving a single universal decoder binary across every file it produces. Architectural response to the decade of Zstandard experience (2016-2025): generic compressors leave ratio on the table for structured data, but hand-rolled per-format compressors multiply the surface to ship + audit + patch + trust. OpenZL's resolution: push format-awareness into an input parameter + learned Plan that resolves to a decode recipe embedded in the frame. Headline numbers on Silesia sao (M1/clang-17): zstd -3 = 5.5 MB / x1.31 / 220 MB/s comp / 850 MB/s decomp; xz -9 = 4.4 MB / x1.64 / 3.5 MB/s comp / 45 MB/s decomp; OpenZL = 3.5 MB / x2.06 / 340 MB/s comp / 1200 MB/s decomp — both higher-ratio than xz and faster in both directions. Six load-bearing architectural primitives: (1) structure as explicit input via SDDL (declarative) or a registered parser function, rather than guessed by the compressor; (2) trained Plan emitted by the OpenZL trainer from a budgeted search over transform choices + parameters, with an internal cluster finder (groups like-behaving fields) + graph explorer (scores candidate subgraphs); can emit a speed/ratio Pareto-set or target-under-constraint; (3) reversible transform sequence before entropy coding — the sao walkthrough: split header → AoS → SoA → per-field transform pick (delta for mostly-sorted X-axis SRA0, transpose for bounded Y-axis SDEC0, tokenize for low-cardinality IS/MAG/XRPM/XDPM fields, each routed to its own subgraph); "The main work is to group data into homogeneous streams. After that, one can count on openzl to take care of the rest."; (4) per-frame resolution into a Resolved Graph recorded in the frame, enabling the universal decoder property — "even when the compression configuration changes, the decoder does not" — whose four enumerated benefits are single audit surface, fleet-wide improvements (SIMD / bounds / scheduling benefit every historic frame), operational clarity (same CLI + metrics + dashboards), and continuous training; (5) runtime control points — per-frame branch points reading lightweight statistics (string repetition, run-length, histogram skew, delta variance) and picking a subgraph "with zero complexity added to the decoder" because the chosen branch is recorded; (6) Managed Compression as the operational runtime — Meta's existing service that automated zstd-dictionary compression (2018) extends to OpenZL Plans: register use case → monitor → sample → re-train → roll out "like any other config change". Fallback safety net: when no structure is available (pure text — enwik, dickens) OpenZL falls back to zstd and delivers essentially zstd performance; CSV is a structural ceiling (~64 MB/s parse cap vs zstd's ~1 GB/s). Four Pareto-curve dataset categories evaluated against generic compressors: Silesia sao (AoS), ERA5 Flux (columnar 64-bit numeric), Binance + TLC Green Trip (uncompressed Parquet — OpenZL parses Parquet + learns schema), PPMF Unit (CSV — parse-bound). Open-source at github.com/facebook/openzl · whitepaper arXiv:2510.03203. First compression-framework source on the wiki — prior compression coverage was Meta MLow audio codec, MongoDB WiredTiger page compression, Cloudflare shared-dictionary HTTP compression (all specialized); OpenZL is the first general framework for format-aware compression to land on the wiki.
2025-03-04 — A case for QLC SSDs in the data center (first-party Meta Engineering; Data Center Engineering) — Meta makes the architectural case for QLC NAND flash as a new middle storage tier between HDD and TLC flash — the first first-party disclosure of Meta's flash-media strategy on the wiki. Substance across four axes: (1) HDD BW/TB is dropping as density climbs without IOPS improvements — "bandwidth per TB for HDDs has been dropping" — stranding hot workloads on the cold tier. Canonical BW/TB framing on the wiki, extending concepts/hard-drive-physics's flat-IOPS observation to the bandwidth axis. (2) QLC's historical blockers closed — 2 Tb NAND dies + 32-die stacks mainstream; endurance matched via workload-matching (read-BW-intensive, low-write-BW targets); 6× density target over densest TLC server. Canonical concepts/storage-media-tiering + patterns/middle-tier-storage-media instance. (3) Form-factor argument: U.2-15mm wins (scales to 512 TB; accepts both standard NVMe QLC SSDs and Pure Storage DFMs at 600 TB); E1.S rejected (too small for QLC NAND-package count); E3 rejected (4-variant fragmentation). (4) Storage-software adaptation: canonical ublk + io_uring + userspace FTL stack for DFMs; standard NVMe path uses io_uring against NVMe block device directly. The 4×+ read-vs-write throughput asymmetry of QLC forces rate controllers + I/O schedulers so latency-sensitive reads don't queue behind writes. Co-design extension: Meta × Pure Storage is a new partner in the OCP co-design lineage (joining Microsoft, NVIDIA, AMD). Honest caveat: Meta states QLC "is not yet price competitive enough for a broader deployment" — the deployment is justified today by power efficiency + density, not TCO parity.
2025-04-30 — Building Private Processing for AI tools on WhatsApp (first-party Meta Engineering; Security) — Meta previews Private Processing, the confidential-computing infrastructure WhatsApp will use to run AI features (message summarisation + writing suggestions) over end-to-end-encrypted conversations without Meta, WhatsApp, or any third party seeing the plaintext. Architecture preview — launch projected "in the coming weeks" from 2025-04-30; detailed security engineering design paper promised at launch. The core architectural move: build LLM inference inside a TEE (CVM + Confidential-Compute-mode GPUs) whose binary digest is attested against a third-party-operated transparency ledger before the client releases its ephemeral session key — turning "we promise we run safe code" into a mechanism the client can verify. Canonical wiki instance of TEE-for-private-AI-inference — a structural alternative to both on-device inference (too constrained) and normal server-side inference (breaks E2EE). Five foundational requirements stacked: (1) confidential processing (no system outside Private Processing sees data in transit or in use); (2) enforceable guarantees (tamper attempts fail closed or become publicly discoverable); (3) verifiable transparency (published CVM binary + ledger + open-source + expanded Bug Bounty); (4) non-targetability (attacker cannot target a specific user without attacking the whole system); (5) stateless + forward-secure (no retained access after session). Wire-session chain (phases 1-6): anonymous credentials via Meta's Anonymous Credential Service (ACS) (open-sourced December 2022, now load-bearing) → HPKE keys from a third-party CDN → OHTTP through a third-party relay (strips client IP before Meta's gateway sees the request — canonical wiki instance of patterns/third-party-ohttp-relay-for-unlinkability) → RA-TLS session with attestation-verified-against-ledger before session-key release (canonical wiki instance of patterns/attestation-before-session-key-release) → ephemeral E2EE device↔CVM request → response returned under the same ephemeral key. CVM-to-CVM traffic reuses the same RA-TLS primitive — inter-CVM trust boundary is also attested. Threat modeling is the structural spine of the post: three named assets (messages + CVM TCB + keys), three actor classes (insiders, third-party / supply-chain vendors, malicious end users), three named scenarios (product-surface exploitation including prompt injection; CVM observability-side-channel extraction; insider physical/remote tampering at boot or runtime). Named operational controls: containerised hardened binaries inside the CVM + log-filtering egress (observability vs confidentiality tension explicitly named) + restricted-environment multi-party-reviewed CVM build + encrypted DRAM + physical datacentre controls + enhanced host monitoring outside the CVM + remote shell access prohibited. Canonical wiki instance of publish-binary-digest-ledger: the transparency-side companion to attestation — every acceptable CVM binary digest on an append-only third-party ledger plus the CVM image binary published, so external researchers can detect "attested X was never on the published list". Data minimisation operationalised at the request-API layer: the summarisation call carries only the messages the user directed AI to summarise — less content in means less blast-radius for any bug anywhere in the stack. User-control composition: Private Processing is opt-in at request granularity + users can disable AI features per-chat via the separate Advanced Chat Privacy feature. Canonical wiki instance of defence-in-depth for private-AI-inference on top of a TEE substrate — each requirement closes a distinct adversary move (runtime vs boot-time vs targeted-host vs post-session), each with different trust roots (CPU vendor / third-party ledger operator / third-party relay operator / external researchers) so compromise of any single party does not break the guarantee. Caveats: architecture-preview voice — no production numbers (latency, throughput, fleet size); TEE vendor not named (AMD SEV-SNP / Intel TDX / Arm CCA all plausible); confidential-GPU vendor/mode not named (NVIDIA Hopper CC is the obvious candidate); third-party relay + CDN + ledger operators not named; attestation protocol details not specified beyond "RA-TLS"; prompt-injection-specific defences flagged but not detailed; open-source scope gestured but not manifest.
2025-03-07 — Strobelight: A profiling service built on open source technology (first-party Meta Engineering; Production Engineering; 177 HN points) — Meta Engineering introduces Strobelight, Meta's fleet-wide profiling orchestrator — not a single profiler but a scheduler + coordinator of 42+ different profilers (many built on eBPF) running on every Meta production host. Four load-bearing architectural claims: (1) profiling is orchestrated, not monolithic — canonical wiki instance of the new patterns/profiler-orchestrator pattern; three execution modes (on-demand CLI/UI; continuous curated-profilers-on-every-host "flight recorder" posture per patterns/default-continuous-profiling; triggered via Configerator config); safety owned centrally — PMU counter coordination, profiler queue, DB-write rate limits; (2) engineers ship new profilers in hours, not weeks, via bpftrace scripts — canonical wiki instance of patterns/ad-hoc-bpftrace-profiler + concepts/ad-hoc-profiler; (3) dynamic sampling rate tuning + per-sample weights — config expresses desired-samples-per-hour-per-service (e.g. 40,000), Strobelight tunes run-probability daily to hit the target, and sample weights are emitted alongside every record so cross-host + cross-service aggregation is mathematically valid — the mechanism that makes fleet-scale "horizontal wins" tractable; (4) delayed symbolization via a central service — raw addresses + frame-pointer-unwound stacks (on frame pointers enabled on all Meta user-space binaries, paying the 1-2% register-pressure tax fleet-wide) are shipped off-host to a symbolization service pre-indexing DWARF + ELF + gsym + blazesym for every Meta production binary; canonical wiki instance of patterns/delayed-symbolization-service. Capacity impact: the continuous LBR profiler feeds the FDO pipeline — CSSPGO compile-time + BOLT post-compile — delivering "up to 20% reduction in CPU cycles" = 10-20% fewer servers on Meta's top 200 largest services; the "Biggest Ampersand" one-character & fix (added after auto on a hot-path std::vector copy in a large Meta ads service, spotted by Scuba-filtering on symbolized std::vector call-sites) saved "an estimated 15,000 servers per year". Output substrate: Scuba is the primary query/UI — flame-graphs, time-series, distributions, free-form query — producing visualisable profile data within seconds. Output surface extensions: Stack Schemas DSL (concepts/stack-tag-enrichment, inspired by Microsoft's stack tags) tags/filters frames at query time, and Strobemeta attaches runtime metadata (request IDs, endpoint names, latency buckets) to samples at sample time via thread-local storage (concepts/runtime-metadata-attach) — together they make request-context-aware profiling possible without post-hoc trace joins. Tracery — client-side-columnar-DB JavaScript trace UI — consumed by the Crochet profiler combining request spans + CPU-cycles stacks + off-CPU data on a single timeline. Open-source status: "currently working on open-sourcing Strobelight's profilers and libraries" at github.com/facebookincubator/strobelight. Several supporting libraries already open: systems/bpftrace, systems/blazesym, jemalloc (memory profiler backend), BOLT. Opens the Production Engineering axis on the Meta page — first canonical fleet-profiling ingest. Third canonical production eBPF use-case family on the wiki: profiling (this post) joins security (systems/datadog-workload-protection / FIM) and networking (Cloudflare DDoS, Lambda Geneve/NAT). Second canonical Scuba upstream producer alongside FBCrypto (two different upstream shapes — aggregated counts + symbolized profile samples — feeding the same warm-query surface). Canonical wiki fourth observability-pillar statement — profiling alongside metrics / logs / traces — sister-framed to Grafana Pyroscope 2.0's open-source continuous-profiling architecture. Canonical wiki "profiling pays for itself at hyperscale" datum — the LBR → FDO → BOLT + CSSPGO pipeline on top-200 services is the economic engine that justifies Strobelight as a platform. Announcement-voice feature-overview; caveats: qualitative dynamic-sampling math ("some very simple math"), no symbolization-service QPS/latency, no Scuba row-count/retention numbers, no 42-profiler category breakdown, no fleet-overhead numbers.
2024-12-19 — Indexing code at scale with Glean (first-party Meta Engineering, 132 HN points) — the canonical architecture overview of Glean, Meta's open-source code-indexing system (open-sourced August 2021). Canonical wiki instance of centralized ahead-of-time indexing — shared fleet indexes the monorepo, databases replicated across a widely-distributed query service, clients ask questions over the network instead of downloading the index. Four load-bearing architectural bets: (1) generality over use-case fit — Glean "doesn't decide for you what data you can store", each language owns its schema, non-code data is supported; RocksDB-backed; paid off with post-launch extensions to dead-code detection, build-graph analysis, API-migration tracking, test coverage + selection, automated data removal, and RAG in AI coding assistants; (2) Angle as declarative logic-based query language — predicates ≈ SQL tables, facts ≈ rows, derived predicates = SQL-view analogue — the mechanism that implements language-neutral schema abstraction (keep language-specific facts underneath, define cross-language views in the schema itself); published latencies ~1 ms for name+namespace lookup, few-ms first-results on inheritance-chain queries; (3) incremental indexing targeting O(changes), realistic floor O(fanout) — the C++ header-modify case names fanout as the transitive #include-ers closure, computed as a fixpoint Angle query; implemented via stacked immutable databases — non-destructive layered adds/hides, single-view semantics per revision, multi-revision concurrent queries, delta-sized storage; (4) diff sketches — indexer runs on each diff to produce a machine-readable change summary (classes/methods/fields/calls added/removed/modified); fans out to static analysis, semantic lint, rich notifications, commit-level semantic search (stack-trace → recent-touching-commits), and review-time go-to-definition in Phabricator across C++/Python/PHP/JavaScript/Rust/Erlang/Thrift/Haskell — canonical wiki instance of diff-based static analysis. Companion subsystem Glass (open-source) = the symbol server on top of Glean; one API call documentSymbols(repo, path, revision) renders outline+nav for the code browser (embedded Monaco); owns per-language symbol IDs that stay stable under code motion, so doc URLs survive refactors. IDE augmentation hybrid — Meta's VS Code C++ extension serves go-to-def / find-references / hovercards from Glean at startup, blends with clangd as files load (C++ was the launch target "due to the long compile times"). Contrasted against LSIF — Glean is deliberately more general than the LSP-ecosystem incumbent. No fleet-size, QPS, or index-size numbers disclosed — scale claims qualitative; latency datapoints illustrative not load-tested; stacked-database deep-dive deferred to glean.software/blog/incremental/ (not ingested).
2024-12-18 — Translating 10M lines of Java to Kotlin (first-party Meta Engineering; Android team; 227 HN points) — Meta's retrospective on the multi-year effort to translate the entire Android codebase — "roughly ten million lines of perfectly good Java code" across ~100,000 files — from Java to Kotlin, driven end-to-end by an internal pipeline called Kotlinator. "Well past the halfway point" with >40,000 conversions shipped at publication. Kotlin-first Android at Meta since 2020. No LLM / no agent — deterministic AST transformations + compiler-error heuristics, the rules-based-tooling counterexample at ~50× the code scale of the AI-driven migrations (Cloudflare vinext, Instacart Jetpack Compose). Pipeline architecture — six phases: (1) deep build for symbol resolution; (2) preprocessing (~50 Java→Java steps on top of Meta's custom metaprogramming tool Editus: nullability, J2K workarounds, custom-DI accommodations); (3) headless J2K — the JetBrains IntelliJ converter driven from a server process by an IntelliJ plugin extending ApplicationStarter + calling JavaToKotlinConverter directly; canonical wiki instance of concepts/headless-ide-inspection; (4) postprocessing (~150 Kotlin→Kotlin steps for Android-specific changes, nullability, idiomaticity); (5) linters with autofixes; (6) build-error-based fixes — canonical wiki instance of concepts/build-error-driven-fix-loop parsing the Kotlin compiler's error messages and applying targeted edits (missing imports, inserted !!). >200 custom steps across preprocessing + postprocessing + build-error-fix phases. Per-file remote conversion takes ~30 minutes versus the couple-minutes local-IDE-button baseline — deliberate trade: "overall conversion time grew longer... but time spent by the developers decreased substantially" (developer-seconds-per-file → 0). Metaprogramming-on-broken-code substrate: Meta's Editus tool is built on JetBrains' PSI libraries for Java + Kotlin; canonical wiki statement concepts/metaprogramming-on-broken-code — "Unlike most metaprogramming tools, it is very much not a compiler plugin, so it can analyze broken code across both languages, and does so very quickly" across "several thousand unbuildable Java and Kotlin files". Delivery channel: internal cron system producing a daily batch of diffs against developer-defined selection criteria, auto-routing reviewers, running tests + validations, shipping on approval — canonical wiki instance of patterns/daily-diff-cron-for-automated-migration; on-demand web UI runs the same pipeline. Null safety as the load-bearing invariant, not syntax: canonical wiki concept concepts/interlanguage-null-safety — Kotlin inserts implicit checkNotNull at the Java/Kotlin boundary in bytecode, making nullability operational (runtime-failed-fast) rather than static-only as with Nullsafe / NullAway ("static analysis is only 100% effective for 100% code coverage, which is simply not viable"). New canonical concept concepts/kotlin-platform-type for Kotlin's String! type assigned to unannotated Java values. Meta runs >12 complementary nullability codemods plus a new Java compiler plugin that collects actual null flows at Java/Kotlin interop boundaries — canonical wiki instance of concepts/runtime-nullability-telemetry — feeding codemods that backfill @Nullable "even in Java files that we may never translate." Bot-safer-than-human principle: canonical wiki concept concepts/bot-safer-than-human — Meta "automates [some] fixes... even though they aren't strictly necessary, because we want to minimize the temptation for human (i.e., error-prone) intervention." Named example: condensing long chains of null checks — correctness-equivalent but "less susceptible to a well-meaning developer accidentally dropping a negation." Upstream collaboration unblocked the remaining long tail: in early 2024 JetBrains began adapting J2K to the K2 compiler, and Meta co-invested in J2K client hooks + fixes for years-old bugs ("disappearing override keywords") — canonical wiki pattern patterns/upstream-collaboration-as-migration-unblock, the mutual-extension-points variant of patterns/upstream-the-fix, letting Meta's Android-specific steps live inside J2K (better symbol resolution for third-party libraries) and eventually donate to the open-source community via kotlin_ast_tools (github.com/fbsamples/kotlin_ast_tools). Architectural shape canonicalised: patterns/pipeline-with-open-ended-passes — ordered phases where any phase accumulates arbitrary custom steps over time; no single transform hits the long tail; the pipeline is the long-tail solution ("the vast majority of conversion diffs produced by the vanilla J2K would not build"). Wrapping pattern: patterns/automated-migration-at-monorepo-scale composing headless-IDE-inspection + pipeline-with-open-ended-passes + metaprogramming-on-broken-code + build-error-driven-fix-loop + daily-diff-cron + bot-safer-than-human + upstream-collaboration. Mixed-language monorepo tax introduced as new framing: "Compiling Kotlin is slower than compiling Java, but compiling both together is the slowest of all" — the load-bearing cost argument for finishing the migration, not just writing-new-code-in-Kotlin. Extends concepts/monorepo with the mixed-language-compile-speed-tax axis. Canonical wiki material from the post: 9 new pages (5 concepts + 4 patterns) in this completion pass on top of the 7 system + 2 concept pages from an earlier partial ingest; 1 new pattern-page extension (closed-feedback-loop-ai-features with Kotlinator as its third Meta-domain instance per the 2026-04-16 Capacity Efficiency and 2026-03-18 Friend Bubbles log entries' cross-references). Caveats: no per-phase cost breakdown; no false-positive / revert / NPE / regression rate after ship; no build-speed numbers; "well past the halfway point" is directional with no explicit ratio or projected completion date; only a subset of the 200+ custom steps is open-sourced. Twenty-sixth Meta ingest on the wiki, sibling to the broader Meta developer-tools + source-control axis: Sapling 2024-09-10 (VCS), Glean 2024-12-19 (code indexing + diff sketches), FFmpeg fork retirement 2026-03-09, WebRTC fork retirement 2026-04-09, tribal knowledge pre-compute 2026-04-06, Capacity Efficiency Platform 2026-04-16. Kotlinator is the deterministic-migration-at-hyperscale anchor in that lineage.
2024-12-02 — How Meta built large-scale cryptographic monitoring (first-party Meta Engineering; CryptoEng team) — Meta's telemetry architecture underneath FBCrypto, Meta's managed cryptographic library. Logs every cryptographic operation fleet-wide — no sampling — so Meta can (a) detect key overuse + rotate proactively (concepts/key-overuse-detection), (b) inventory call-sites for deprecated-primitive + PQC migration scoping, (c) use call-volume / success-rate as a client-health proxy during rollouts. Scale datum: "roughly 0.05 % of CPU cycles at Meta are spent on X25519 key exchange" — forces the architecture. Core primitive: an aggregating buffered logger inside FBCrypto on folly::ConcurrentHashMap — increment a (key-name, method, algorithm, …) tuple's counter on every operation, periodic background flush through Scribe to Scuba (warm) + Hive (cold). Three disclosed optimisations: (1) per-host first-flush jitter smooths cohort-synchronised write spikes to Scribe from hosts that restart together; (2) derived-key aggregation counts KDF-derived child-key operations against the parent keyset to bound cardinality for features that mint millions of keys (pessimistic for overuse-detection alarms — safe direction); (3) folly::Singleton-backed synchronous shutdown flush drains the in-memory buffer on job exit despite shutdown-environment constraints. Canonical wiki instance of unified-library-for-fleet-telemetry + the unified-library leverage strategic posture: FBCrypto + Scribe being monoculture means instrumenting once inside FBCrypto yields fleet-wide observability for free. Canonical wiki instance of concepts/cryptographic-monitoring + concepts/telemetry-buffer-and-flush. Extends concepts/post-quantum-cryptography with the fleet-inventory-producer framing — knowing where classical asymmetric primitives are in use is the migration-scoping prerequisite the rollout-shape posts (GitHub / Cloudflare) depend on.
2024-10-15 — Meta's open AI hardware vision (first-party Meta Engineering, timed to OCP Global Summit 2024) — Meta contributes its next-generation AI-hardware stack to the Open Compute Project and projects forward. Four headline disclosures: (1) Catalina — a new 140 kW liquid-cooled AI rack on the ORv3 HPR chassis, hosting NVIDIA GB200 Grace Blackwell; modular, flexible, OCP-contributed; (2) Grand Teton extended to AMD Instinct MI300X — Meta's monolithic AI platform gains a second accelerator vendor, positioned for "large-scale AI inference workloads"; (3) Disaggregated Scheduled Fabric (DSF) — Meta's open vendor-agnostic AI backend, built on OCP-SAI + FBOSS + Ethernet-RoCE, with multi-vendor endpoint support (NVIDIA + Broadcom + AMD); companion silicon: new 51T fabric switches on Broadcom + Cisco ASICs and FBNIC, Meta's first in-house network ASIC; (4) Mount Diablo — Meta × Microsoft disaggregated 400 VDC power rack, continuing the long-standing Meta × Microsoft OCP co-design lineage (SAI 2018 → OAM → Mount Diablo 2024). Forward projection: ~1 TB/s per-accelerator injection bandwidth and matching bisection bandwidth — "more than an order of magnitude" growth over today's networks. Positions H100-based Grand Teton (2024-06 two-24K-GPU-cluster training substrate) as the predecessor generation; Catalina is the successor platform. Canonical wiki instance of open-hardware-for-AI-scaling thesis + modular-rack-for-multi-accelerator + co-design-with-OCP-partners patterns. Llama 3.1 405B training disclosed at > 16,000 H100 GPUs.
2024-09-10 — Sapling: Source control that's user-friendly and scalable (first-party Meta Engineering; post dated 2022-11-15, fetched into the wiki's raw corpus 2024-09-10) — Meta open-sources the Sapling client after "10 years" of internal development. Positioned as two orthogonal design axes (usability + scale) combined in one product. Scale claim: Sapling serves Meta's monorepo at "tens of millions of files, tens of millions of commits, and tens of millions of branches", a regime "public source control systems were not, and still are not, capable of handling." Mercurial lineage, not Git fork — started as Mercurial extension, diverged into its own storage / wire / algorithms; Git interop at the client-surface layer only. Key scale primitives: Segmented Changelog (megabyte-scale commit-graph index + O(log n) bisection for log/blame, "even in Git repositories"); lazy history download (clone ≈ free, fetch on demand); virtual file system (not yet open-sourced); sparse checkout + organization-owned sparse profiles checked into the repo (thousands of engineers on shifting subsets with zero per-engineer config); Watchman for sl status scan-avoidance. Key UX primitives: smartlog (default view, sl invokes it); undo subsystem (sl undo -i interactive scroller, sl hide/unhide, "never again should you have to delete your repository and clone again"); first-class commit-stack workflow built on mutation history tracking (inspired by Mercurial Evolve) — sl restack, sl absorb, sl amend --to COMMIT, sl fold, sl split. Companion system: ReviewStack — demo stack-oriented code-review UI for GitHub pull requests at reviewstack.dev. Open-source scope: client only; the Sapling-compatible server (Rust), VFS, Commit Cloud, and per-file history graphs are "we hope to open-source" — no commitment. Canonical wiki counterpoint to Dropbox's 87 GB GHEC monorepo (where tuned server-side Git repack is still viable) — at Meta's scale, a dedicated VCS is the only path.
2024-08-31 — How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale (first-party Meta Engineering; PEPR 2024 companion) — Meta's technical explainer of Privacy Aware Infrastructure (PAI) — the initiative embedding first-class privacy constructs into Meta's software stack — and its anchor technology Policy Zones, an information flow control (IFC)-based runtime enforcement of purpose limitation integrated across HHVM (web/middle/backend) + Presto + Spark. Canonical wiki canonicalisation of IFC-at-hyperscale: rejects point-checking + ACLs + data lineage combo as insufficient for scaling across "dozens of our systems" and "millions of data assets"; references Denning's 1976 lattice model as the theoretical foundation; annotations on assets at table/column/row/cell granularity or parameter/variable/return-value granularity; logging → enforcement rollout via Policy Zone Manager (PZM) four-step workflow (identify assets → discover flows → remediate violations → continuously enforce/monitor); 10× improvements in computational efficiency disclosed through lattice-representation simplification + language-level context-propagation features in Hack / C++ / Python. Five named adoption lessons including focus on one end-to-end use case first and separate annotation from requirement. Meta's ML-based data classifier named as Step-1 auto-discovery input to PZM. Canonical runtime information flow enforcement + logging-mode-to-enforcement-mode rollout pattern instances.
2024-08-23 — Leveraging AI for efficient incident response (first-party Meta Engineering) — Meta's AI-assisted root-cause analysis (RCA) system for web monorepo incident investigations. Two-stage retrieve-then-rank-LLM: a heuristic retriever (code + directory ownership + runtime code graph) narrows "thousands of changes to a few hundred"; a fine-tuned Llama 2 (7B) ranks via ranking-via-election (B=20, K=5) to produce a top-5 list. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Pipeline: CPT on internal wikis/Q&As/code → mixed SFT with a dedicated RCA-SFT dataset of ~5,000 instruction-tuning examples (2-20 candidates each) → a second SFT round producing logprob-rankable ordered lists. Safety discipline named explicitly: closed feedback loops + explainability + confidence thresholding ("sacrificing reach in favor of precision"). Positioned as successor to Hawkeye (December 2023, ML-workflow debugging); future work: autonomous full-workflow execution + pre-push incident prediction.
2024-08-05 — DCPerf: An open source benchmark suite for hyperscale compute applications (first-party Meta Engineering) — Meta open-sources DCPerf, a benchmark suite where each benchmark is anchored to a real Meta production application and validated against production at microarchitectural level (IPC + core-frequency comparison graphs) vs SPEC CPU. Canonical statement of hyperscale compute workload as a distinct market segment requiring its own benchmark suite; canonical wiki instance of patterns/workload-representative-benchmark-from-production and patterns/pre-silicon-validation-partnership (two-year vendor-collaboration on pre-silicon / early-silicon bring-up, multiple core-microarchitecture + SoC-power-management optimizations identified). x86 + ARM, chiplet-architecture evaluation, multi-tenancy for rising core counts.
2024-06-16 — Maintaining large-scale AI capacity at Meta (first-party Meta Engineering) — how Meta patches, upgrades, and verifies the GPU training fleet (target: 600,000 GPUs) while guaranteeing "all capacity minus one maintenance domain" up 24/7. Names >30 maintenance operations, >50 component classes, 3 host-verification tasks, "thousands of disruptive AI host tasks per day", and OpsPlanner — Meta's unified disruptive-work orchestrator at ~1M ops/day. Introduces the maintenance train + maintenance domain + two-layer sliding upgrade architecture.
2024-06-13 — MLow: Meta's low bitrate audio codec (first-party Meta Engineering) — Meta's new RTC audio codec: CELP + split-band + range-encoded, 2× POLQA MOS over Opus at 6 kbps with 10% lower compute, enabling inband FEC at 14 kbps where Opus needs ≥19 kbps. Shipped on Instagram + Messenger + (rolling) WhatsApp. Canonical Meta example of the classical-DSP-over-ML-on-compute-constrained-targets decision: Encodec was available but >20% of Meta calls are on ARMv7 and 10s of millions of daily WhatsApp calls are on 10+-year-old devices.
2024-08-05 — A RoCE network for distributed AI training at scale (first-party Meta Engineering; SIGCOMM 2024 paper summary) — engineering deep-dive on the RoCE fabric underneath the 24K-GPU RoCE GenAI cluster, supporting Llama 3.1 405B training. Introduces the AI Zone two-stage Clos template (RTSW + CTSW + ATSW aggregator), the full routing evolution (baseline ECMP → concepts/path-pinning → E-ECMP + QP scaling with +40% AllReduce), and the counterintuitive congestion-control end state: DCQCN off at 400G for a year+, PFC + NCCL receiver-driven admission instead. Canonical wiki instance of patterns/collective-library-transport-codesign.
2024-06-12 — How Meta trains large language models at scale (first-party Meta Engineering) — the canonical Meta statement on the Llama 3 training substrate: two 24,000-GPU H100 clusters built in parallel on RoCE + InfiniBand, modified Grand Teton at 700 W air-cooled, with three stack-level network optimisations and an enumerated GPU-failure taxonomy.
2023-07-16 — Lessons Learned Running Presto at Meta Scale (republished on High Scalability; Meta-authored)

Key systems surfaced via Meta¶

SilverTorch — Index as Model recsys retrieval substrate (2026-05-26 SilverTorch post)¶

systems/silvertorch — new canonical wiki page for Meta's GPU-native unified retrieval substrate. Replaces the orchestrator + user-tower + ANN + filter + scoring microservice mesh with a single PyTorch model under the Index as Model paradigm. 23.7× throughput / 20.9× TCO efficiency vs same-architecture multi-service baseline on an 80M-item production retrieval workload; top-K capability raised from Faiss-GPU's 2,048 to "hundreds of thousands". Widely adopted across Facebook + Instagram + Threads feed + video. SIGIR 2026 paper at arXiv:2511.14881.
systems/torchrec — new canonical wiki page for PyTorch's sparse-table sharding library. SilverTorch's fourth-tier scale-out for embedding tables that span HBM + GPU host DRAM + remote CPU-host DRAM, "decoupling sparse data movement from computation."
systems/torch-compile — new canonical wiki page for PyTorch's GPU-kernel-rewriting compiler. Named in the SilverTorch post as the substrate-level optimisation SilverTorch inherits for free by being one PyTorch model.

systems/meta-tao — new canonical wiki page for Meta's production social-graph store (The Associations and Objects, USENIX ATC 2013). Objects + associations as first-class API primitives backed by sharded MySQL with geo-distributed caching. Referenced on the wiki via PlanetScale's benchmarking post; the two VLDB papers from Audrey Cheng's Berkeley/Meta team (VLDB 2021 workload characterisation + VLDB 2022 TAOBench) are the authoritative workload disclosure.
systems/taobench — new canonical wiki page for the open-source VLDB 2022 benchmark by Cheng + Meta engineers that synthesises the TAO workload into a runnable objects + edges schema (concepts/social-graph-objects-and-edges) with two workload profiles (A = application-transactional, O = overall) and a three-phase protocol (load → bulk-reads warmup → experiments). Deliberately stresses hot-row and thundering-herd behaviour — the viral-content workload pressure TPC-C/sysbench-tpcc don't cover. Introduced on the wiki via PlanetScale's 2022-09-08 Tech Solutions post ().

Facebook Groups scoped search — hybrid retrieval (2026-04-21 Groups Search post)¶

systems/meta-groups-scoped-search — the discussions-module product surface on Facebook Search for group-authored content. Re-architected 2026 from pure keyword retrieval onto a hybrid pipeline (parallel lexical + semantic arms merged at MTML L2).
systems/meta-unicorn-inverted-index — Unicorn, Facebook's in-house inverted-index system since the 2013 Graph Search era. First canonical wiki page. Sparse-lexical-retrieval arm of the hybrid pipeline; contributes TF-IDF + BM25 features to the L2 ranker.
systems/meta-ssr-search-semantic-retriever — Search Semantic Retriever (SSR). 12-layer, 200M-parameter encoder model producing dense query vectors for semantic retrieval. First canonical wiki SSR size disclosure.
systems/faiss — Meta's open-source vector-similarity-search library (github.com/facebookresearch/faiss). Production ANN substrate for the semantic-retrieval arm; precomputed index of group-post embeddings. First canonical wiki Faiss page.
systems/llama-3 (extended) — now also Meta's own LLM-as-judge in the Groups scoped-search BVT pipeline, grading results on a graded rubric (exact-match / somewhat-relevant / irrelevant).

Facebook Groups scoped search concepts + patterns (2026-04-21 Groups Search post)¶

concepts/sparse-lexical-retrieval — new canonical concept. Inverted-index-based retrieval scoring documents on exact/prefix term matches (TF-IDF, BM25). Complementary half of the hybrid-retrieval pattern; not replaced by dense retrieval in hybrid architectures but paralleled with it.
concepts/dense-semantic-retrieval — new canonical concept. Encoder-produced vector retrieval with ANN lookup and cosine-similarity scoring. Also called EBR (embedding-based retrieval) in Meta's production shorthand (L2 Model + EBR (Hybrid) ranker config).
concepts/query-preprocessing-tokenization-normalization — new canonical concept. The pre-retrieval stage (tokenize / normalize / rewrite) that produces clean inputs for both the inverted index and the embedding model. Shared-upstream-stage insight: modern hybrid retrieval makes query preprocessing a dual-consumer primitive, not a lexical-only concern.
concepts/somewhat-relevant-evaluation-category — new canonical concept. The distinctive middle-tier rubric label for LLM-judge grading of semantic-search results where query and result share a common domain or theme without exact-term match. Enables diversity + conceptual-match measurement; the graded-not-binary rubric shape Meta uses.
concepts/hybrid-retrieval-bm25-vectors (extended) — first social-network scoped-search instance on the wiki, complementing the enterprise/dev-search instances (Dash, Figma AI Search, Atlas, Cloudflare AI Search).
concepts/hybrid-search (extended) — add Meta Groups Scoped Search as a production instance.
concepts/multi-task-multi-label-ranking (extended) — third Meta-domain MTML instance after Reels recommendation (Friend Bubbles) + Ads ranking (Meta Adaptive Ranking Model). L2 ranker heads = clicks + shares + comments.
concepts/llm-as-judge (extended) — first Meta-authored LLM-as-judge instance on the wiki + first CI-integrated variant.
patterns/decoupled-parallel-retrieval-pipelines — new canonical pattern. Run lexical and semantic retrieval as parallel independent pipelines fed by a shared query-preprocessing stage, merging candidates only at the ranker. Meta's explicit statement: "we modernized the retrieval stage by decoupling the query processing into two parallel pathways, ensuring we capture both exact terms and broad concepts." Neither arm depends on the other's output; heterogeneous features (sparse lexical + dense semantic) arrive at the ranker as distinct inputs; ranker learns fusion.
patterns/llm-judge-in-build-verification-test — new canonical pattern. Integrate an LLM judge directly into the CI/build-verification-test path so every candidate build is graded against a benchmark query set before production rollout. Meta's Llama 3 multimodal judge with graded rubric is the canonical instance. Stronger operational stance than offline-leaderboard-eval siblings (Zalando / Instacart) — judge verdict is a build-gate, not only a training signal. Motivation: "validate quality at scale without the bottleneck of human labeling."

KernelEvolve agentic kernel-authoring system (2026-04-02 KernelEvolve post)¶

systems/kernelevolve — Meta's agentic kernel-authoring system for heterogeneous AI hardware (NVIDIA / AMD / MTIA / CPU). Six components: LLM Synthesizer + Tree Search Engine (MCTS + evolutionary; configurable memory operator per node) + Retrieval-Augmented Knowledge Base (self-evolving skill library) + Automated Evaluation Framework + Shared Data Foundation + Agentic RL. Production: >60% inference throughput on Andromeda over torch.compile + vendor libs baseline; >25% training throughput on ads model on MTIA; 100% KernelBench pass rate; "trillions of daily inference requests" of production kernels. Canonical wiki instance of agentic kernel synthesis at hyperscale.
systems/meta-mtia — Meta's custom AI accelerator silicon; four chip generations in two years (MTIA 300 → 500). Programmable via MTIA C++; profiled via MTIA Insight (PE utilization + DPE/SFU/MLU metrics + per-PE memory bandwidth). The proprietary-silicon axis that motivates RAG-over-hardware-docs + agentic kernel synthesis — no public LLM has been trained on MTIA code.
systems/triton-dsl — Open-source Python-embedded GPU DSL (OpenAI origin); TLX is Meta's experimental fork at github.com/facebookexperimental/triton. High-level DSL target for KernelEvolve's LLM synthesizer (portable NVIDIA + AMD).
systems/cute-dsl — NVIDIA's tensor-layout DSL (underpins CUTLASS). Another high-level DSL target for KernelEvolve alongside Triton / TLX / FlyDSL.
systems/tritonbench — Meta's open-source Triton-kernel benchmark harness at github.com/meta-pytorch/tritonbench; numerical correctness vs PyTorch + end-to-end speedup on production shapes. One of five layers of KernelEvolve's evaluation framework.
systems/kernelbench — Stanford's 250-problem external kernel-optimization benchmark; KernelEvolve achieves 100% pass rate. Headline external-benchmark disclosure in the 2026-04-02 post.
systems/nvidia-ncu — NVIDIA Nsight Compute; per-kernel GPU hardware metrics (occupancy, memory throughput, instruction mix). One component of KernelEvolve's multi-tool evaluation framework.
systems/proton-profiler — Intra-kernel instruction-level GPU profiler (IEEE document 11395207); complements NCU with finer-grained signal.
systems/meta-andromeda-ads — Meta's "next-gen personalized ads retrieval engine" (introduced 2024-12-02 Meta Production Engineering, not ingested on the wiki). KernelEvolve's headline production beneficiary: >60% inference throughput improvement on NVIDIA GPUs over a torch.compile + vendor-libraries baseline.

KernelEvolve concepts + patterns (2026-04-02 KernelEvolve post)¶

concepts/heterogeneous-ai-accelerator-fleet — new canonical concept. The forcing function: {hardware types × generations × model architectures × operators} combinatorial explosion exceeds human-expert kernel-tuning capacity. Drives the need for agentic kernel synthesis.
concepts/kernel-optimization-as-search — new canonical concept. Reframes kernel optimization as search over implementations; candidates are tree nodes; MCTS + evolutionary with structured feedback. Distinguishes agentic kernel authoring from one-shot LLM codegen + from compiler autotuners.
concepts/in-context-reinforcement-learning — new canonical concept. Learning that compounds through writes to a persistent retrieval store consulted at inference time, not through weight updates. Meta calls the self-evolving skill library "a form of in-context reinforcement learning." No model retraining required.
concepts/hardware-proprietary-knowledge-injection — new canonical concept. Encode proprietary-hardware architecture manuals + ISA refs + memory-hierarchy specs + optimization patterns into a retrieval store; inject into LLM context at generation time. Makes LLM-based codegen productive against silicon whose docs aren't in pretraining data (MTIA). Engineering-cost inversion: "When a new chip arrives, the engineering cost shifts from writing thousands of kernels by hand to curating a set of hardware documents and injecting them into the knowledge base."
concepts/agentic-kernel-synthesis — new canonical concept. The system-level framing — LLM synthesizer + tree search + RAG + evaluation harness + agentic RL composed into a production pipeline. Distinct from one-shot LLM codegen and from compiler autotuning.
patterns/rag-over-hardware-documentation — new canonical pattern. Hierarchical retrieval store with correctness constraints + platform-agnostic optimization guidance + hardware-specific docs + self-evolving skill library; retrieved dynamically by runtime signal (memory bottleneck → memory-hierarchy docs, compilation error → debugging guidance). Pattern implementation of hardware-proprietary-knowledge-injection.
patterns/tree-search-over-llm-candidates — new canonical pattern. Tree-search engine wrapping an LLM candidate generator; MCTS + evolutionary operators; configurable memory operator per node (inherit-parent / compare-siblings / combine-both / clean-slate) is the structural primitive beyond naive independent sampling.
patterns/evaluation-harness-in-agent-loop — new canonical pattern. Multi-layer structured-profiling harness (correctness + performance + system-profile + kernel-metrics + intra-kernel + accelerator-specific) composed via compiler-centric abstraction (MLIR-level instrumentation + profiling passes + trace synthesis); feeds why-signal back to the LLM synthesizer — structured > scalar feedback.
patterns/agentic-rl-from-production-signal — new canonical pattern. Close the loop between production deployment and model training: every agent session generates structured training data as natural byproduct; reward is from measured production outcomes; specialized smaller models post-trained on trajectories retain most of the frontier model's capability at compact-model serving cost. Virtuous-cycle data flywheel.

Kotlinator Java-to-Kotlin migration pipeline (2024-12-18 translating-10M-lines post)¶

systems/kotlinator — Meta's internal 6-phase pipeline for automated Java→Kotlin translation of the Android codebase at monorepo scale (~10M lines, ~100K files, >40K conversions shipped). Wraps J2K with ~50 preprocessing steps + ~150 postprocessing steps + linter pass + build-error-driven fix loop. Runs headless on a remote fleet; ~30 min/file; delivered via daily-diff cron. Canonical wiki instance of automated-migration-at-monorepo-scale.
systems/j2k-converter — JetBrains' open-source IntelliJ Java-to-Kotlin Converter (github.com/JetBrains/intellij-community/tree/master/plugins/kotlin/j2k). Kotlinator's core translator; Meta contributed headless mode + K2-compiler client hooks (2024). The "single-button IDE tool that does not scale without surrounding infrastructure" archetype on the wiki.
systems/intellij-platform — JetBrains' IDE platform. Host for J2K; host for Meta's Kotlinator plugin (extends ApplicationStarter to drive J2K from a server process).
systems/psi-libraries — JetBrains' Program Structure Interface. Java + Kotlin AST/semantic layer; the substrate on which Meta's internal Editus metaprogramming tool operates. Canonical wiki instance of "metaprogramming-on-broken-code" enabling tech — PSI "very much not a compiler plugin, so it can analyze broken code across both languages, and does so very quickly."
systems/kotlin-ast-tools — Meta's open-source subset of Kotlinator postprocessing transformations at github.com/fbsamples/kotlin_ast_tools. Future expansion target as Meta ports custom steps onto J2K's client hooks.
systems/nullsafe — Meta's Java null-safety static analyser (Infer-adjacent; github.com/facebook/infer/blob/main/infer/annotations/). Load-bearing prerequisite for Kotlin translation; limitation is coverage-bounded static analysis ("100% effective only for 100% coverage") — motivates runtime-nullability-telemetry as the closing complement.
systems/nullaway — Uber's sibling null-safety static analyser (github.com/uber/NullAway); named alongside Nullsafe as the class of tool any Java-to-Kotlin migration assumes.
systems/javac-plugin — the Java compiler plugin mechanism. Meta built a javac plugin that emits runtime-nullability-recording shims at Java/Kotlin interop boundaries — canonical wiki instance of concepts/runtime-nullability-telemetry and of the broader "when static analysis runs out of road, instrument production" discipline.

Kotlinator concepts + patterns (2024-12-18 translating-10M-lines post)¶

concepts/headless-ide-inspection — new canonical concept. Take an IDE inspection / refactor that was designed to run interactively and drive it from a server process with no UI. Enables parallelism, CI integration, cron scheduling. Meta's unlock for parallel J2K execution.
concepts/metaprogramming-on-broken-code — new canonical concept. AST-level tooling built on PSI-style libraries (not compiler plugins) so it can parse, analyse, and rewrite code that doesn't currently compile. Load-bearing when the intermediate state of a migration is unbuildable by construction.
concepts/build-error-driven-fix-loop — new canonical concept. Treat the compiler's error messages as the specification for the next round of edits: run the compiler, parse the errors, apply targeted edits, rebuild. The compiler is the oracle; the tool never has to re-implement compiler logic. Kotlinator phase 6.
concepts/interlanguage-null-safety — new canonical concept. Annotations + static analysers + runtime checks at Java/Kotlin boundaries. Kotlin's distinctive contribution: implicit checkNotNull at the interlanguage boundary in bytecode, making nullability operational at runtime rather than static-only as in Java's @Nullsafe/NullAway. Load-bearing concept under any Java-to-Kotlin migration of scale.
concepts/kotlin-platform-type — new canonical concept. Kotlin's String! type for values coming from unannotated Java. Three-state type system (String / String? / String!). Assigning a platform-type value to a Kotlin non-nullable inserts a runtime checkNotNull; translation pins the API to String or String? (irreversibly), crystallising any hopeful annotation into production behaviour.
concepts/runtime-nullability-telemetry — new canonical concept. Instrument the Java runtime (via a javac plugin) to record actual null flows through unannotated parameters + return types in production. Use the collected data to drive codemods that backfill correct @Nullable annotations, improving accuracy even for "Java files that we may never translate." The dynamic complement to static analysis.
concepts/bot-safer-than-human — new canonical concept. Deliberately automate transformations that aren't strictly necessary, precisely because they're delicate enough that human reviewers will get them wrong. Counterintuitive: scope up the bot to de-scope the human. Named example: condensing long chains of null checks — correctness-equivalent but "less susceptible to a well-meaning developer accidentally dropping a negation."
patterns/pipeline-with-open-ended-passes — new canonical pattern. Structure the transformation as an ordered sequence of named phases where any phase accumulates arbitrary custom steps over time. No single transform hits the long tail; the pipeline is the long-tail solution. Kotlinator: 6 phases, >200 custom steps across 3 of them.
patterns/automated-migration-at-monorepo-scale — new canonical pattern. The wrapping architectural pattern for large-scale migrations inside a monorepo: scale-out IDE-derived tooling via headless execution, per-file fan-out on a server fleet, daily-diff cron as the delivery channel, human review only at the final diff stage. Composes headless-IDE-inspection + pipeline-with-open-ended-passes + metaprogramming-on-broken-code + build-error-driven-fix-loop + daily-diff-cron + bot-safer-than-human + upstream-collaboration.
patterns/upstream-collaboration-as-migration-unblock — new canonical pattern. When a migration hits the ceiling of what can be done outside the core tool, invest in upstream contribution + collaboration to add extension points (client hooks, pre/post hooks, symbol-resolution APIs). The mutual-extension variant of patterns/upstream-the-fix. Meta × JetBrains J2K client hooks (2024 K2 port) is the canonical instance.
patterns/daily-diff-cron-for-automated-migration — new canonical pattern. Internal cron system produces a daily batch of migration diffs against user-defined selection criteria, auto-routes reviewers, runs tests, ships on human approval. Functionally equivalent to a PR bot, embedded in the organisation's diff-review infrastructure. Smooths review attention, automates routing, prevents commit storms.
patterns/closed-feedback-loop-ai-features (extended) — third Meta-domain instance after RCA + Friend Bubbles / FBDetect. Kotlinator's daily-diff cron enforces human approval as the shipping gate, and the validation-layer accumulation over 40,000+ conversions ("we've learned about many of these the hard way and now have several layers of validation") is the same closed-feedback-loop discipline applied to a bot-driven migration pipeline: the bot's output remains reviewable, and reviewer feedback flows back into new custom pipeline steps.

PQC migration framework (2026-04-16 PQC migration-strategy post)¶

systems/ml-kem — new canonical system page. NIST FIPS 203 module-lattice PQ KEM (Kyber). Meta's recommended PQ KEM: ML-KEM-768 default, ML-KEM-512 exception (per NIST PQC FAQ endorsement) for performance-constrained cases.
systems/ml-dsa-signature (extended) — NIST FIPS 204 module-lattice PQ signature (Dilithium). Meta's recommended PQ signature: ML-DSA-65 default, ML-DSA-44 exception. Meta positions ML-DSA as preferable to SPHINCS+ (larger signatures) and Falcon (floating-point arithmetic requirement).
systems/hqc — new canonical system page. NIST-selected 2025 (fifth PQC algorithm). Code-based KEM; non-lattice mathematical foundation. Meta cryptographers co-authored HQC (plus BIKE and Classical McEliece). Role: algorithmic-diversity hedge against potential weaknesses in ML-KEM's lattice approach.
systems/liboqs — new canonical system page. Open Quantum Safe library under Linux Foundation PQCA. Meta is a PQCA member and contributes (supports + fixes bugs including issue #1548).
systems/fbcrypto (extended) — re-positioned as the library PQC migration flows through — the substrate crypto-inventory's automated-discovery side instruments, and the surface that receives the PQC guardrails (guideline updates + key-generation friction + build-system API rules via Buck).
systems/buck2 — extended with the crypto-API-guardrails use case: build-system policy-enforcement point warning teams during code review on RSA / ECDH API usage.

PQC migration framework patterns + concepts (2026-04-16 PQC migration-strategy post)¶

concepts/pqc-migration-levels — new canonical concept. Five-rung maturity ladder (PQ-Unaware → PQ-Aware → PQ-Ready → PQ-Hardened → PQ-Enabled) organised around time-to-react-to-quantum-event. Even PQ-Ready is valuable — it reduces reaction time without yet enabling protection. PQ-Hardened exists for use cases where literature gaps (e.g. efficient PQ-OPRFs) prevent full enablement today.
concepts/pqc-prioritization-framework — new canonical concept. Three-tier classification by attack class (offline / online / Grover-only) rather than asset value. High (offline-attackable via Shor on public-key encryption + key exchange) split by external-dependency status; Medium (online-attackable via Shor on signatures) split by patching capability; Low (Grover-only on symmetric with inadequate parameters).
concepts/time-to-react-to-quantum-event — new canonical concept. The urgency metric organising the PQC Migration Levels ladder. Each rung permanently reduces the time needed to respond to a "relevant quantum event" (CRQC breakthrough / standards publication / new attack).
concepts/crypto-inventory — new canonical concept. The organisation-wide mapping of where cryptographic primitives are used. Prerequisite for any migration. Meta's two-strategy approach: automated discovery + developer reporting.
concepts/hybrid-vs-replacement-pqc-deployment — new canonical concept. The deployment-path decision axis. Meta's position: hybrid by default during transition, because SIKE's 2022 cryptanalytic invalidation demonstrates newer-standards risk.
patterns/pqc-migration-ladder — new canonical pattern. Structure the PQC migration programme as a laddered set of reachable milestones with independently-budgetable rungs.
patterns/crypto-api-guardrails — new canonical pattern. Three-layer prevent-new-vulnerable-usages discipline: (1) updated crypto guidelines; (2) friction on key-generation tooling; (3) build-system rules warning on RSA / ECDH API use during code review.
patterns/automated-discovery-plus-developer-reporting — new canonical pattern. Inventory-building by combining two complementary mechanisms with disjoint failure modes: runtime monitoring (captures active usage in primary libraries) + developer reporting (captures shadow dependencies, new architectures, intent).
patterns/third-party-dependency-quantum-assessment (extended) — Meta's three-class external-dependency enumeration (community-vetted standards / hardware / production implementations) + the contribute-upstream-to-unblock-your-own-migration consumer posture.
concepts/post-quantum-cryptography (extended) — the migration-framework companion to the 2024-12-02 inventory framing. Canonical Meta statement of migration-strategy principles (Effectiveness / Timeliness / Performance / Cost Efficiency).
concepts/cryptographic-monitoring (extended) — explicit "Crypto Visibility" positioning as the automated-discovery leg of the complementary-inventory pattern.

Fork modernization + dual-stack WebRTC (2026-04-09 WebRTC-escape-the-fork post)¶

systems/meta-webrtc-shim — Meta's dual-stack proxy library sitting between application code and two coexisting WebRTC implementations (legacy + upstream), exposing a single version-agnostic API and dispatching per-call via a runtime flavor enum. Statically linked into every app binary so 50+ RTC use cases (Messenger, Instagram, Cloud Gaming, Meta Quest VR casting) can A/B test each new upstream release against the legacy baseline. Load-bearing techniques: scripted namespace rewriting (webrtc:: → webrtc_latest:: / webrtc_legacy::) resolving thousands of ODR collisions; bulk using declarations for backward compatibility; AST-based codegen (1/day → 3–4/day); Buck build-graph duplication for injected components. 10,000+ new lines of shim code; hundreds of thousands of lines modified across thousands of files; no major issues. Binary-size cost: 5 MB uncompressed at the WebRTC layer vs 38 MB at the call-orchestration layer (87% reduction from layer choice). Outcome: launched webrtc/latest at M120, now at M145 ("living at head").
systems/libwebrtc — the upstream Google/Chromium-maintained WebRTC library Meta used to fork. Each Chromium release has an anchor Git tag (M143 = tag 7499, M145 ≈ 7559). Meta is now "living at head," ingesting new upstream releases continuously.
systems/meta-quest — one of the 50+ RTC use cases migrated onto the shim; the VR-casting surface. Named here with the other 49+ surfaces as an illustration of the shim's reach across Meta's client surfaces.
systems/buck2 — Meta's open-source build system; load-bearing for the WebRTC shim's injected-component duplication path ("dynamically changing namespaces at build time, duplicating the high-level build target, and exposing symbols for both flavors through a single header"). The alternative to source-level shimming when components plug deeply into WebRTC internals.
systems/chromium-git — the Chromium Git + tooling context libwebrtc lives in. Meta's external patch-tracking Git repo is based directly on libwebrtc's tree so Meta can reuse the Chromium tools (gn, gclient, git cl) without building an internal parallel.

Fork modernization patterns + concepts (2026-04-09 WebRTC post)¶

patterns/shim-for-dual-stack-ab-testing — new canonical pattern. Interpose a shim at the lowest practical layer to statically link two versions of a library in one binary, dispatch per-call at runtime, A/B test the upgrade across consumers. The load-bearing architectural move of Solution 1.
patterns/ast-codegen-for-boilerplate-shim — new canonical pattern. AST-parse the library's headers to emit adapter/converter/unit-test scaffolding for each class/struct/enum/constant. Velocity lift: 1 shim/day → 3–4 shims/day. Extends concepts/abstract-syntax-tree into the C++ codegen axis distinct from the query-language axis.
patterns/bulk-namespace-import-for-backcompat — new canonical pattern. Instead of hand-forward-declaring every symbol from webrtc_latest:: back into webrtc::, use bulk C++ using declarations: zero binary-size cost, new symbols automatic, single-flavor builds naturally supported, external engineers write unchanged code.
patterns/external-feature-branch-repo-for-monorepo-patches — new canonical pattern. Resolution to Solution 2's constraint: monorepo lacks branching, so maintain OSS patches as tag-anchored Git feature branches in a separate repo based on the upstream's own tree. Four named benefits: parallelizable, preserves Git history, LLM-friendly for auto-conflict-resolution, submit-ready upstream.
patterns/fork-retirement-via-ab-test — new canonical pattern. The migration strategy enabled by the shim: A/B test each use case app-by-app, ship, delete the legacy code, keep the shim for ongoing upstream-release A/Bs. Sibling to patterns/upstream-the-fix for cases where upstreaming the fork's features isn't possible immediately.
concepts/internal-fork-divergence — new canonical concept. The "forking trap": internal fork + upstream evolution + local patches → merge cost becomes prohibitive. The failure mode the escape-the-fork architecture is the escape hatch from.
concepts/shim-layer — new canonical concept. A thin proxy between consumers and one or more underlying implementations. Where you shim has order-of-magnitude cost consequences (5 MB vs 38 MB in Meta's case).
concepts/odr-violation — new canonical concept. C++ One-Definition-Rule violation. Forced by statically linking two copies of a library. Fixed by namespace renaming.
concepts/symbol-renamespacing — new canonical concept. Scripted rewrite of every namespace in a library copy to a flavor-specific prefix so both copies can coexist. The mechanical enabler of dual-stack.
concepts/feature-branch-patch-management — new canonical concept. Tracking patches against upstream OSS as Git feature branches per upstream tag, rather than stored patch files.
concepts/runtime-flavor-dispatch — new canonical concept. Template-specialization-based dispatch between implementations chosen by a runtime flag. The code-organization pattern that makes a dual-stack shim maintainable.
patterns/upstream-the-fix (extended) — ninth canonical instance and the dual-stack-A/B-harness variant, orthogonal to the seventh-instance FFmpeg "upstream the features, then delete the fork" shape. Both end at fork retirement via different mechanisms.
concepts/binary-size-bloat (extended) — the shim-layer-placement variant: Meta's 5 MB vs 38 MB datum canonicalises shim-layer-selection as a binary-size decision.
concepts/abstract-syntax-tree (extended) — the C++-codegen variant: AST of library headers → baseline adapter/converter/test scaffolding.
concepts/monorepo (extended) — the missing-branches-force-external-repo variant: Meta's monorepo constraint is what drives Solution 2's external Git repo.

Capacity efficiency AI platform (2026-04-16 Capacity Efficiency post)¶

systems/meta-capacity-efficiency-platform — Meta's unified AI-agent platform for hyperscale performance engineering. Two layers: MCP Tools (profiling queries · experiment results · configuration history · code search · documentation) + Skills (senior-engineer reasoning patterns encoded as domain-expertise modules telling the LLM which tools to use + how to interpret results). Same tools across offense + defense; skills differ per use case. Compounding platform-leverage: "each new capability requires few to no new data integrations since they can just compose existing tools with new skills." Program outcomes: hundreds of megawatts of power recovered program-wide; ~10-hour manual investigations compressed to ~30 minutes (~20×). Canonical wiki instance of patterns/mcp-tools-plus-skills-unified-platform.
systems/fbdetect — Meta's in-house performance-regression-detection tool (SOSP 2024). Catches regressions as small as 0.005% in noisy production environments; surfaces "thousands of regressions weekly"; correlates each regression to a root-cause PR using traditional techniques. The defense-side detector the AI Regression Solver acts on. First wiki ingest; SOSP 2024 paper at tangchq74.github.io/FBDetect-SOSP24.pdf.
systems/meta-ai-regression-solver — Meta's defensive AI agent producing fix-forward PRs for FBDetect-detected regressions, sent to the original root-cause PR author for review. Replaces the rollback-vs-ignore binary with a third option: automated mitigation. Three-phase pipeline: context (regressed functions + root-cause PR + changed files) → skill (e.g. logging → sampling) → resolution (PR to root-cause author).
systems/model-context-protocol (extended) — Meta's tool layer speaks MCP. Adds the hyperscaler-internal-infrastructure-tools canonical instance to the MCP corpus alongside the existing SaaS-facing instances (Datadog, Dropbox Dash, Cloudflare, Fly.io).

Capacity efficiency patterns + concepts (2026-04-16 Capacity Efficiency post)¶

patterns/mcp-tools-plus-skills-unified-platform — new canonical pattern. Two-layer platform: shared MCP tool surface + pluggable skill bundles + per-use-case agents. Same tools across specialties; specialties differ only in skills. Tool-reuse amortizes data-integration cost; skill authoring becomes the senior-engineer leverage mechanism. Adding a new AI use case = writing a new skill bundle, not a new pipeline.
patterns/ai-generated-fix-forward-pr — new canonical pattern. Detector attributes regression to root-cause PR → AI agent generates mitigation PR → routes to root-cause PR author for review. Dominates both baselines (rollback pays velocity; ignore pays capacity) when the skill catalogue covers the regression class. Canonical instance: Meta AI Regression Solver on top of FBDetect.
patterns/opportunity-to-pr-ai-pipeline, systems/ml-kem, systems/hqc, systems/liboqs, concepts/pqc-migration-levels, concepts/pqc-prioritization-framework, concepts/time-to-react-to-quantum-event, concepts/hybrid-vs-replacement-pqc-deployment, concepts/crypto-inventory, patterns/pqc-migration-ladder, patterns/crypto-api-guardrails, patterns/automated-discovery-plus-developer-reporting, patterns/third-party-dependency-quantum-assessment, systems/ml-dsa-signature — new canonical pattern. Offense sibling to ai-generated-fix-forward-pr. Opportunity library + pattern docs + examples + validation criteria → AI agent generates candidate fix with guardrails (syntax/style/right-issue) → lands in engineer's editor for one-click apply. Compresses per-candidate investigation from hours to review-minutes; handles the long tail that engineers would never get to manually.
patterns/specialized-agent-decomposition (extended) — fifth framing: skill-over-shared-tools composition. Agents specialise by skill bundle rather than by subject-matter domain (Storex) / sub-tool complexity (Dash) / refinement-loop role (DS-STAR) / offline-pipeline stage (Pre-Compute Engine). Meta's platform carries ≥7 specialist agents over the same tool layer.
patterns/closed-feedback-loop-ai-features (extended) — fourth Meta-domain instance after RCA + Kotlinator + Friend Bubbles. Fix-forward PR routing to root-cause author + offense candidate in engineer's editor are both human-in-the-loop closures; author accountability is preserved.
concepts/capacity-efficiency — new canonical concept. The discipline of reducing compute/power/capacity demand per unit of product value at hyperscale. Meta's program-level frame — at 3B+ user scale, 0.1% regressions cost significant power. Human-engineer-time is the named bottleneck; AI that multiplies per-engineer throughput is the lever.
concepts/offense-defense-performance-engineering — new canonical concept. The two-sided frame: proactively find optimizations (offense) + catch regressions (defense). Both problems share the same data shape — same tool layer, different skills — which is what made the unified platform economical. Generalisable to reliability / security / cost.
concepts/encoded-domain-expertise — new canonical concept. Meta's "skills" primitive: senior-engineer reasoning patterns expressed as markdown modules telling the LLM which tools to invoke + how to interpret results. Model-agnostic (markdown not embeddings); sibling to compass-shape context files from the 2026-04-06 Pre-Compute Engine post. Compass-shape is descriptive; skills are prescriptive.
concepts/context-engineering (extended) — third Meta 2026 instance alongside Pre-Compute Engine (offline compass-shape files) and implicit retrieval-augmented instances. Meta's Capacity Efficiency Platform is the runtime-composed skills + tools form; pairs with the offline-preloaded compass-shape form to cover both prescriptive-and-descriptive encoded-knowledge axes.

AI-agent pre-compute engine + tribal-knowledge extraction (2026-04-06 AI-pre-compute post)¶

systems/meta-ai-precompute-engine — Meta's Data Platform internal production system that produces and maintains a navigable tribal-knowledge layer over a 4-repo / 3-language / 4,100-file config-as-code pipeline. Three parts: (1) pre-compute swarm — 50+ specialised agents in one session producing 59 compass-shaped context files (25-35 lines / ~1,000 tokens each, 4 mandated sections); (2) runtime surface — the 59 files + cross-repo dependency index (30× compression on "what depends on X?") + data-flow maps + NL orchestration layer; (3) self-maintenance loop running "every few weeks" enforcing zero hallucinated file paths. Preliminary results on 6 tasks: 40% fewer tool calls per task, ~2 days → ~30 min workflow-guidance cycles, critic scores 3.65 → 4.20 / 5.0 across 3 rounds, 0 hallucinated paths. Model-agnostic — markdown not embeddings. Reuses Meta's operational-AI lineage (Meta RCA 2024-08-23) for the "is the pipeline healthy?" routing path via 85+ historical incident patterns.

AI-agent pre-compute engine + tribal-knowledge patterns + concepts (2026-04-06 AI-pre-compute post)¶

patterns/precomputed-agent-context-files — new canonical pattern. Extract module-level knowledge (purpose / modification patterns / build-failure patterns / deps / tribal knowledge) into compass-shaped markdown files via a one-session multi-agent orchestration pass; consume them at runtime opt-in per task. Pre-training-overlap-sensitive — helps on proprietary tribal-knowledge-heavy code, hurts on codebases LLMs already saw (2025 academic research on Django/matplotlib). Extraction pays once; consumption pays many times.
patterns/multi-round-critic-quality-gate — new canonical pattern. Gate AI-generated durable artifacts behind 3 rounds of independent critic-agent scoring + fixer-agent corrections with hard invariants (zero hallucinated paths) at the final round. Distinct from runtime concepts/llm-as-judge by applying pre-release to durable content with a separate fixer stage between rounds. Meta: 10+ critic passes × 3 rounds + 4 fixers took scored quality from 3.65 → 4.20 / 5.0.
patterns/five-questions-knowledge-extraction — new canonical pattern. Per-module analyst methodology: (1) what does this configure? (2) common modification patterns? (3) non-obvious build-failure patterns? (4) cross-module deps? (5) tribal knowledge in comments? Failure-first shape — questions 3 and 5 target silent-wrong-output mitigations. Q5 produced 50+ non-obvious patterns at Meta. Maps 1:1 to the four compass-shape sections of the output file.
patterns/self-maintaining-context-layer — new canonical pattern. Four-step automated refresh loop (validate paths / detect coverage gaps / re-run critics / auto-fix stale references) at a bounded cadence (Meta: "every few weeks"). Operational answer to concepts/context-file-freshness; applicable to runbooks / migration guides / API changelogs beyond context files.
concepts/tribal-knowledge — new canonical concept. Undocumented domain-specific conventions, invariants, and failure modes that live only in engineers' heads and scattered code comments. Pretraining-overlap-asymmetry argument: the knowledge LLMs don't have is precisely the knowledge that matters on proprietary codebases.
concepts/compass-not-encyclopedia — new canonical concept. Per-module context files at 25-35 lines / ~1,000 tokens with 4 mandated sections (Quick Commands / Key Files / Non-Obvious Patterns / See Also). Design principle: "No fluff, every line earns its place." Token-budget discipline at the per-file level; composes with context-engineering at the per-turn level.
concepts/config-as-code-pipeline — new canonical concept. Workload class whose behaviour is primarily driven by code-versioned config files across multiple subsystems; the highest-yield workload for the precompute-context-files pattern. Meta's example: 4 repos / 3 languages / 6 subsystems that synchronise on every data-field change.
concepts/context-file-freshness — new canonical concept. "Stale context is worse than no context" — stale context makes agents confidently wrong where no context makes them slowly explore. Asymmetry argument + automated-refresh operational discipline.
concepts/context-engineering (extended) — canonical offline-preloading instance added alongside the existing retrieval-augmented (Instacart) and runtime-budget (Fly.io / Dropbox / Datadog) instances.
patterns/specialized-agent-decomposition (extended) — fourth framing added: the offline-context-generation framing with 9 pipeline-stage roles. Sits alongside Storex (domain-based), Dash (sub-tool), DS-STAR (role-in-refinement-loop).

LLM-scale ads ranking (2026-03-31 Adaptive Ranking Model post)¶

systems/meta-adaptive-ranking-model — Meta Ads's LLM-scale ads-ranking serving stack. Three pillars: (1) request-centric computation (heavy user tower once per request, shared across ad candidates via in-kernel broadcast; long user sequences in centralised KV store joined to training data on the fly); (2) model-system co-design (selective FP8 guided by micro-benchmark, operator fusion, Grouped GEMM, horizontal fusion — 35% MFU across multiple hardware types); (3) multi-card embedding sharding (O(1T) parameter scale, parity with single-card setups). Launched Instagram Q4 2025 (+3% conversions, +5% CTR for targeted users); O(10 GFLOPs)/token complexity at O(100 ms) bounded latency. Accelerated model loading in <10 min via multi-stream + remote caching; SM-utilisation-based auto-scaling.
systems/wukong-turbo — optimised runtime evolution of Meta's internal Wukong ranking architecture used inside Adaptive Ranking Model. Adds no-bias for numerical stability, small-parameter delegation from FSDP to DDP to cut all-gather overhead, and sparsity-based linear-layer simplification — without increasing FLOPs or parameter counts.
systems/wukong-meta — stub for the 2024 Wukong paper (arXiv:2403.02545) architecture Wukong Turbo builds on: stackable factorisation machines, sequence learning, cross-layer attention.
systems/meta-instagram — extended with Q4 2025 Adaptive Ranking Model launch as the only named product deployment surface in the post (+3% conversions, +5% CTR for targeted users).

LLM-scale ads ranking patterns + concepts (2026-03-31 Adaptive Ranking Model post)¶

concepts/inference-trilemma-recsys — three-way tension between model complexity, sub-second latency, and cost efficiency at LLM-scale recsys serving. Meta's explicit design frame for Adaptive Ranking Model.
concepts/request-oriented-computation-sharing — heavy user-context computed once per request, broadcast to candidates in-kernel; transforms scaling from linear to sub-linear in candidate count. Extended by in-kernel broadcast for zero extra HBM traffic.
concepts/request-oriented-sequence-scaling — long user behaviour sequences processed once per request; centralised KV store of user logs joined with training data on the fly replaces per-row replication.
concepts/selective-fp8-quantization — post-training quantisation applying FP8 only to micro-benchmark-verified precision-tolerant layers; the alternative to quality-destroying blanket FP8 casts for ranking-sensitive domains.
concepts/multi-card-embedding-sharding — architectural primitive for embedding tables exceeding single-GPU memory; split across hardware-aware cluster with communication optimisations, achieving parity with single-card setups.
concepts/unified-embeddings — multiple features share one embedding table, reducing memory footprint.
concepts/hash-collision-embedding-tradeoff — core tension in embedding-table sizing (oversize overfits, undersize collides); motivates sparsity-aware allocation + pruning + unified embeddings.
concepts/hardware-aware-model-architecture — model-design discipline aligning structure with accelerator capabilities (dtype, memory hierarchy, Tensor Core shapes, kernel-launch overhead). Canonical wiki statement tied to 35% MFU outcome.
concepts/model-flops-utilization — extended with Meta's 35% MFU data point across heterogeneous hardware in recsys serving.
patterns/request-centric-inference-architecture — the overall pattern; restructures inference from per-(user, candidate) to per-request.
patterns/model-system-codesign-ranking — the four co-design levers (selective FP8, operator fusion for shared inputs, Grouped GEMM, horizontal fusion) that drive MFU.
patterns/multi-card-sharded-embedding-serving — serving-layer pattern for terabyte-scale embedding tables; decouples parameter count from single-GPU memory ceilings.
patterns/selective-mixed-precision-quantization — generalised wiki pattern for per-layer micro-benchmark-guided precision assignment; Adaptive Ranking Model is the canonical ranking-domain instance.

Foundational allocator stewardship (2026-03-02 jemalloc post)¶

systems/jemalloc — Meta's tier-0 foundational memory allocator, maintained upstream. Originally by Jason Evans (FreeBSD, 2005); Facebook-era standard allocator. GitHub repository was archived in 2024 during a stewardship drift; unarchived in 2026-03-02 alongside Meta's public reset. 2026 roadmap: technical debt reduction, HPA improvements for THP CPU efficiency, memory efficiency (packing/caching/purging), AArch64 out-of-box performance. Positioned "alongside the Linux kernel and the compilers" as the foundational-software class requiring "the highest rigor." Previously on the wiki only as the memory-profiler backend of Strobelight — this post promotes it to a first-class Meta page.

Video transcoding infrastructure (2026-03-09 FFmpeg post)¶

systems/ffmpeg — the open-source multimedia CLI that Meta runs tens of billions of times per day for DASH VOD + livestream transcoding. Meta co-developed upstream with FFlabs and VideoLAN the two features that let Meta fully retire its internal FFmpeg fork: threaded multi-lane transcoding (FFmpeg 6.0 → 8.0, "the most complex refactoring of FFmpeg in decades") and in-loop decoding (FFmpeg 7.0+) for real-time per-lane quality metrics. Both upstreaming efforts spanned years and multiple releases.
systems/ffprobe — FFmpeg's companion media-inspection CLI. Meta runs it at the same invocation scale as ffmpeg.
systems/meta-msvp — Meta Scalable Video Processor, Meta's custom video-transcoding ASIC. Integrated into Meta's FFmpeg via the same standard hardware-accel API that exposes NVDEC/NVENC/UVD/QSV. Patches kept internal because MSVP hardware is Meta-only and external FFmpeg developers cannot validate changes without it — canonical wiki instance of patterns/keep-infrastructure-specific-patches-internal. Meta accepts the reverse-rebase cost against each new upstream release.
systems/nvidia-nvenc-nvdec — NVIDIA's fixed-function hardware encode/decode engines. Pre-existing FFmpeg hardware-accel target; named alongside MSVP as a peer implementation of the shared API.
systems/intel-quick-sync-video — Intel's iGPU media engine. Pre-existing FFmpeg hardware-accel target; second peer implementation of the shared API named in the post.

Video transcoding patterns + concepts (2026-03-09 FFmpeg post)¶

concepts/video-transcoding — the general primitive; decode a source + re-encode to one or more target encodings. FFmpeg is the canonical toolchain.
concepts/adaptive-bitrate-streaming-dash — DASH (Dynamic Adaptive Streaming over HTTP); client dynamically selects among pre-encoded renditions. Forces a multi-lane encoding ladder at production time.
concepts/multi-lane-encoding-pipeline — architectural shape producing N encoded outputs from one source; the ladder for DASH. Single-process multi-output is the efficient form.
concepts/in-loop-quality-metrics — reference metrics (PSNR/SSIM/VMAF) computed during transcoding by inserting a decoder after each encoder; the unblock for livestream quality monitoring.
concepts/visual-quality-metric — the metric category itself. Reference metrics (PSNR/SSIM/VMAF) vs no-reference metrics. Post names PSNR, SSIM, VMAF as the production-relevant set.
concepts/hardware-accelerated-video-codec-api — FFmpeg's shared abstraction across NVENC/NVDEC/UVD/QSV/MSVP; the architectural primitive that keeps MSVP integration narrowly scoped and reverse-rebasable.
patterns/deduplicate-decode-across-encoder-lanes — one FFmpeg command, one decoder, N parallel encoders. Canonical Meta-driven upstream win (FFmpeg 6.0 → 8.0).
patterns/in-loop-decoder-for-realtime-quality-metrics — per-lane reference metrics during encoding, in a single command. Canonical Meta-driven upstream win (FFmpeg 7.0+).
patterns/keep-infrastructure-specific-patches-internal — the explicit complement to patterns/upstream-the-fix introduced by this post; MSVP is the canonical instance. Together the two patterns form Meta's decision framework for "upstream it / keep it internal".
patterns/upstream-the-fix — extended: Meta × FFmpeg (6.0 → 8.0, 7.0+) is the seventh canonical instance on the wiki and the highest-stakes outcome to date (multi-year, multi-release, fork retirement as the tangible result).

WhatsApp client-side Rust + media-attack-surface defense (2026-01-27 Rust-at-scale post)¶

systems/whatsapp-wamedia — WhatsApp's cross-platform media library, rewritten from 160,000 LoC C++ (without tests) → 90,000 LoC Rust (with tests) with performance + runtime-memory advantages. Deployed across Android / iOS / Mac / Web / Wearables; shipped monthly to "billions of phones, laptops, desktops, watches, and browsers" through WhatsApp + Messenger + Instagram. "The largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of."
systems/whatsapp-kaleidoscope — WhatsApp's ensemble of format-conformance / risk-indicator / file-type-spoof / dangerous-type checks that wamedia enables; the app-layer malware-defense layer that sits in front of OS-provided media parsers the app cannot patch. Canonical defense-in-depth instance on the client-side / media-processing axis.
systems/whatsapp — host product; 3B+ daily users; default E2EE; the cluster of security/privacy systems on this wiki now spans Private Processing (AI over E2EE via TEE), Advanced Chat Privacy (per-chat controls), wamedia + Kaleidoscope (client-side media malware defense), and the Research Proxy (bug-bounty-facing).
systems/messenger + systems/meta-instagram — the other two Meta products that ship the Rust-rewritten wamedia monthly. Same 2015-Stagefright forcing function applies; same rollout cadence.
systems/whatsapp-research-proxy — first canonical wiki instance of a bug-bounty research proxy: Meta publishes a tool "that makes research into WhatsApp's network protocol more effective" to expand the effective security-review headcount via bug-bounty incentive alignment.

Private AI inference on TEE (2025-04-30 WhatsApp Private Processing post)¶

systems/whatsapp-private-processing — Meta's confidential-computing infrastructure for running AI features over WhatsApp's end-to-end-encrypted messages without Meta, WhatsApp, or any third party ever seeing the plaintext. First use case: message summarisation + writing suggestions. Canonical wiki instance of the TEE-for-private-AI-inference architectural pattern. Stacks five foundational requirements (confidential processing, enforceable guarantees, verifiable transparency, non-targetability, stateless + forward security) + six wire-session phases (anonymous credentials → HPKE key fetch → OHTTP through third-party relay → RA-TLS with attestation-against-ledger → ephemeral E2EE request → response). Announcement voice; launch projected "in the coming weeks" from 2025-04-30; security engineering design paper promised at launch.
systems/cvm-confidential-virtual-machine — the CPU-based Confidential Virtual Machine + Confidential-Compute-mode GPU primitive Meta builds Private Processing on. Memory encrypted in hardware; host OS + hypervisor removed from the TCB; remote shell prohibited; CVM-to-CVM calls ride the same RA-TLS primitive. Neither CPU vendor (AMD SEV-SNP / Intel TDX / Arm CCA) nor GPU vendor/mode named in the post.
systems/meta-acs-anonymous-credentials — Meta's Anonymous Credential Service (ACS), open-sourced December 2022, now load-bearing in Private Processing's authentication phase. Mints credentials that prove "authentic WhatsApp client" without identifying the user — the prerequisite that makes OHTTP-relay-based non-targetability actually non-targetable (identity-bearing auth tokens inside the tunnel would defeat IP-stripping at the relay).
systems/whatsapp-advanced-chat-privacy — the pre-existing WhatsApp feature giving users a per-chat opt-out for AI features. Composes with Private Processing: Advanced Chat Privacy provides chat-granularity refusal; Private Processing provides request-granularity E2EE-preserving infrastructure.

Private AI inference on TEE patterns + concepts (2025-04-30 Private Processing post)¶

concepts/trusted-execution-environment — the generic TEE primitive class; CVMs are the VM-granularity realisation.
concepts/confidential-computing — the industry-wide posture of protecting data in use via TEEs; the third leg alongside at-rest + in-transit encryption.
concepts/remote-attestation — hardware-rooted proof that a specific binary is loaded in a genuine TEE; gated against a published ledger.
concepts/ra-tls — the TLS composition that binds session-key release to attestation verification.
concepts/oblivious-http — the two-party-trust-partitioned transport that strips client IP at a third-party relay.
concepts/hpke — the cryptographic primitive (RFC 9180) underneath OHTTP's inner encryption.
concepts/non-targetability — first-class security property: attack cost scales with fleet, not victim. Enabled by OHTTP + anonymous credentials + attestation-against-ledger.
concepts/stateless-processing — service-level discipline making later compromise unable to recover past sessions (there's nothing to find).
concepts/forward-security — ephemeral-key sibling of statelessness; TEE-non-extractable keys destroyed at session end.
concepts/verifiable-transparency-log — third-party-operated append-only ledger of acceptable binary digests; turns attestation from audit evidence into enforcement.
concepts/data-minimization — explicit design axis: requests to Private Processing carry only the data needed for the immediate step (summarisation request = only the messages being summarised).
concepts/end-to-end-encryption — the invariant Private Processing preserves across a server-side compute step.
concepts/anonymous-credential — extended with Meta's ACS as the second canonical industrial instance after Cloudflare / Privacy Pass.
concepts/threat-modeling — extended with the Meta Private-Processing instance: three actor classes, three named scenarios, first canonical wiki instance applied to confidential-computing + private-AI-inference.
concepts/defense-in-depth — extended with the TEE-substrate + transparency-log axis; canonical wiki instance of defence-in-depth for private-AI-inference where each layer relies on a different trust root (CPU vendor / ledger operator / relay operator / external researchers).
patterns/tee-for-private-ai-inference — the containing architectural pattern.
patterns/third-party-ohttp-relay-for-unlinkability — the routing-layer pattern closing the targeted-host attack path.
patterns/attestation-before-session-key-release — the session-gating realisation (RA-TLS).
patterns/publish-binary-digest-ledger — the transparency-side companion to attestation.

Fleet profiling + orchestrator + FDO pipeline (2025-03-07 Strobelight post)¶

systems/strobelight — Meta's fleet-wide profiling orchestrator — scheduler + coordinator + queuer + symbolization frontend over 42+ profilers (memory via jemalloc, function call count, event-based for C++/Python/Java/Erlang, AI/GPU, off-CPU, service-request-latency). Runs on every Meta production host. Three execution modes: on-demand, continuous (default-flight-recorder for every service), triggered. Canonical wiki instance of patterns/profiler-orchestrator. Partially open-sourced at github.com/facebookincubator/strobelight.
systems/meta-configerator — Meta's holistic configuration management system; the config substrate through which service owners commit continuous / triggered profile configs (e.g. add_continuous_override_for_offcpu_data(team, Type.SERVICE_ID, service, samples_per_hour)) that Strobelight picks up fleet-wide. First-class Meta primitive surfaced on the wiki.
systems/meta-bolt-binary-optimizer — Meta's open-source post-compile binary optimiser (CGO 2019, now an LLVM-upstream tool). Consumes LBR-derived FDO profiles alongside compile-time CSSPGO; delivers up to 20% CPU-cycle reduction on Meta's top-200 services. Later extended by the 2026-04-02 Redpanda ingest with the first canonical non-Meta brittleness datum.
systems/bpftrace — open-source high-level tracing DSL for eBPF (github.com/bpftrace/bpftrace); Strobelight's ad-hoc-profiler substrate — engineers ship new profilers in hours instead of weeks by committing a bpftrace script.
systems/tracery-meta — Meta's client-side-columnar-DB-in-JavaScript trace UI; consumed for the Strobelight Crochet profiler combining request spans + CPU-cycles stacks + off-CPU data on one timeline.
systems/blazesym — Meta's open-source multi-language symbolization library in Rust (github.com/libbpf/blazesym); part of the centralised symbolization service backing Strobelight.
systems/gsym — compact DWARF-derived symbolization format (github.com/YtnbFirewings/gsym); paired with blazesym inside the Strobelight symbolization service.
systems/jemalloc — already surfaced via 2026-03-02 stewardship-reset post; this post first-cites it as Strobelight's memory-profiler backend, extending the tier-0-foundational-software framing with its profiling-substrate role.
systems/scuba-meta — extended with Strobelight as a second canonical upstream producer alongside FBCrypto — two different upstream shapes (aggregated counts + symbolized profile samples) feeding the same warm-query surface.
systems/ebpf — extended with the fleet-profiling use-case family, joining the existing security + networking eBPF families on the wiki.

Fleet profiling patterns + concepts (2025-03-07 Strobelight post)¶

patterns/profiler-orchestrator — new canonical pattern. Centralised scheduler + queuer + safety-rules + symbolization-frontend over many specialised profilers. Safety centralised, economics compound, cross-profiler composition becomes feasible. Strobelight is the archetype.
patterns/ad-hoc-bpftrace-profiler — new canonical pattern. Let engineers ship user-authored eBPF scripts as first-class profilers through the orchestrator's safety/queue/symbolization pipeline. New-profiler-lead-time drops from weeks to hours. The escape hatch that makes a centralised orchestrator feel uncentralised from the user's side.
patterns/delayed-symbolization-service — new canonical pattern. Raw addresses + unwound stacks to disk on-host; central pre-indexed service resolves to (function, file, line, type, inlines) off-host. DWARF is parsed once per binary version (not once per sample) using DWARF + ELF + gsym + blazesym. Decouples producer rate from consumer symbolization cost; on-host workload is never squeezed by DWARF parsing.
patterns/default-continuous-profiling — canonical Meta instance (existing page extended). Run curated profilers on every host continuously-but-sparsely, tuned not to perturb workloads, so data is always there when an incident opens.
patterns/feedback-directed-optimization-fleet-pipeline — canonical Meta instance (existing page extended). LBR profiler → FDO profiles → CSSPGO (compile-time) + BOLT (post-compile) → 10-20% server reduction on top-200 services; the closed-loop economic engine that pays for fleet profiling.
concepts/ebpf-profiling — new canonical concept. Profiling via verifier-gated eBPF programs attached to kernel hooks. The "custom actions at sample time" property (reading TLS, cgroup IDs, request-context via eBPF) is what enables runtime-metadata-attach and makes eBPF uniquely suited to production profiling vs perf / ptrace / kernel modules.
concepts/dwarf-debug-info — new canonical concept. The standard debug-information format emitted into ELF; "many megabytes to gigabytes" per binary; too expensive to parse on-host — motivates delayed symbolization.
concepts/dynamic-sampling-rate-tuning — new canonical concept. Feedback-control loop adjusting per-service run probability daily to hit a samples-per-hour target, with sample weights emitted at capture so cross-host + cross-service aggregation remains valid. The mechanism that makes fleet-scale efficiency analysis tractable.
concepts/delayed-symbolization — new canonical concept. Defer address-to-symbol resolution off-host to prevent memory thrash on the profiled host + decouple producer from consumer (so a slow symbolizer can't cause dropped samples).
concepts/frame-pointer-unwinding — new canonical concept. Cheap stack-walk via the frame-pointer register chain; Meta pays the ~1-2% register-pressure tax on every user-space binary fleet-wide because it's the precondition for affordable eBPF-based stack sampling. Canonical wiki echo of Brendan Gregg's 2024 "Return of the Frame Pointers".
concepts/ad-hoc-profiler — new canonical concept. User-authored profiler written in a DSL the platform runs as first-class — the generalised shape the bpftrace-specific pattern instantiates.
concepts/stack-tag-enrichment — new canonical concept. Query-time DSL for tagging / filtering call stacks (inspired by Microsoft stack tags). Strobelight's Stack Schemas: add tags to whole stacks or frames; strip framework frames by regex; apply many schemas per-service or per-profile.
concepts/runtime-metadata-attach — new canonical concept. Attach request-context (request IDs, endpoint, latency buckets) to samples at sample time via thread-local storage + eBPF. Strobelight's Strobemeta: makes "stacks from p99-latency requests on endpoint /search only" queryable without post-hoc joins to trace data.

Flash media tiering + QLC storage (2025-03-04 QLC post)¶

systems/qlc-flash — Meta's new middle-tier NAND flash (4 bits per cell), positioned between HDD and TLC on the BW/TB / cost / endurance / density axes. Invented 2009; finally data-center-viable at Meta scale via 2 Tb dies + 32-die stacks. Deployed on read-BW-intensive + low-write-BW workloads where endurance matches and R/W asymmetry is manageable.
systems/tlc-flash — Meta's incumbent data-center flash tier. Continues for write-heavy + latency-sensitive mixed workloads; QLC sits below it, not instead of it.
systems/pure-storage-directflash-module — Pure Storage DFM + DirectFlash software (userspace FTL). Custom module scalable to 600 TB per drive; physically fits U.2-15mm slots alongside standard NVMe QLC. Canonical Meta × Pure Storage co-design instance on the flash-media axis.
systems/u2-15mm-form-factor — Meta's chosen QLC form factor: industry-standard, scales to 512 TB, accepts DFMs. Wins over E1.S (too small for QLC package count) and E3 (4-variant fragmentation).
systems/e1s-form-factor — Meta's incumbent TLC form factor. Great for TLC; rejected for QLC on volume grounds.

Format-aware compression (2025-10-06 OpenZL post)¶

systems/openzl — OpenZL, Meta's open-source format-aware lossless compression framework (released 2025-10-06). Takes structured data (tabular / columnar / numeric arrays / timeseries / ML tensors / database pages) + an explicit shape description, trains a Plan via a budgeted search over transform choices, resolves the Plan into a concrete per-frame Resolved Graph embedded in the frame, and decodes with a single universal decoder binary. Silesia sao headline: x2.06 ratio at 340 MB/s compression + 1200 MB/s decompression, beating xz -9 on both axes and beating zstd -3 on ratio at comparable speed.
systems/zstandard-zstd — Zstandard, Meta's 2016 general-purpose compressor. The lineage OpenZL is the architectural successor to for structured data, the baseline against which OpenZL is measured, the fallback engine when OpenZL can't find structure to exploit, and the original integration target for Managed Compression.
systems/openzl-sddl — Simple Data Description Language. The declarative way users tell OpenZL what shape their bytes have (rows / columns / enums / nested records). Parser-equivalent alternative is a registered parser function.
systems/managed-compression-meta — Managed Compression, Meta's internal runtime originally built (2018) to automate zstd dictionary training + rollout. In 2025 extended to OpenZL Plans: register use case → monitor → sample production data → periodically re-train → roll out new Plans "like any other config change." Old frames keep decoding; new frames benefit.

Open AI hardware + OCP 2024 (2024-10-15 OCP AI-hardware-vision post)¶

systems/catalina-rack — Meta's next-generation AI rack built on the ORv3 HPR chassis (up to 140 kW), liquid-cooled, hosting NVIDIA GB200 Grace Blackwell. Components: power shelf + compute tray + switch tray + ORv3 HPR + Wedge 400 fabric switch + management switch + BBU + rack management controller. Successor to the air-cooled Grand Teton H100 platform; OCP-contributed.
systems/orv3-rack — the Open Rack v3 high-power-rack variant introduced via Catalina, specified for up to 140 kW per rack — more than an order-of-magnitude beyond typical air-cooled OCP rack power envelopes.
systems/nvidia-gb200-grace-blackwell — NVIDIA's Blackwell-generation Grace+Blackwell Superchip; the silicon Catalina is designed around.
systems/amd-instinct-mi300x — AMD's flagship data-center GPU, now supported on Grand Teton via a new OCP-contributed platform variant positioned for "large-scale AI inference workloads."
systems/meta-dsf-disaggregated-scheduled-fabric — Meta's Disaggregated Scheduled Fabric (DSF): open vendor-agnostic AI networking backend, powered by OCP-SAI + FBOSS + Ethernet-RoCE, with multi-vendor endpoint support (NVIDIA + Broadcom + AMD named). Overcomes scale / component-supply / power-density limits of Meta's prior vertically-integrated switches.
systems/fboss-meta-network-os — FBOSS, Meta's open-source network operating system (2018) for controlling switches; control-plane substrate for DSF.
systems/ocp-sai — Switch Abstraction Interface, the vendor-agnostic switch-ASIC API standard Meta + Microsoft co-developed for OCP in 2018; foundational abstraction for DSF.
systems/fbnic — Meta's first in-house network ASIC (as a NIC module), shared with OCP; silicon response to the projected ~1 TB/s-per-accelerator injection-bandwidth regime.
systems/mount-diablo-power-rack — Meta × Microsoft disaggregated 400 VDC power rack contributed to OCP; allows more AI accelerators per IT rack.
systems/meta-wedge-400 — Meta's OCP-contributed fabric switch (2021), a component of the Catalina rack assembly.
systems/oam-open-accelerator-module — the OCP accelerator-module standard that makes Grand Teton's NVIDIA+AMD multi-vendor support feasible; part of Meta's Meta-Microsoft OCP lineage.
systems/llama-3-1 — Llama 3.1 405B training disclosed at > 16,000 H100 GPUs; re-anchors the 2024-06 two-24K-GPU-cluster substrate as the predecessor scale.

Privacy + information flow control (2024-08-31 PAI post)¶

systems/meta-privacy-aware-infrastructure — Meta's Privacy Aware Infrastructure (PAI) umbrella initiative embedding first-class privacy constructs into the software stack. Publicly announced January 2024; technically detailed at PEPR 2024 + the 2024-08-31 Meta Engineering post. PAI is the multi-year engineering + tooling investment; Policy Zones is the enforcement primitive underneath it.
systems/meta-policy-zones — Meta's information flow control (IFC) runtime technology. Encapsulates + evaluates + propagates privacy constraints for data in transit and at rest at runtime. Integrated across HHVM (function-based: call-tree zone propagation), Presto, and Spark (batch: per-SQL-job zones with table/column/row annotations). Implemented in PAI runtime libraries in Hack / C++ / Python. Performance engineered for "10× improvements in computational efficiency" via simplified policy-lattice representation + native language-level context propagation + canonicalized policy annotation structures. Rollout discipline: logging mode → enforcement mode.
systems/meta-policy-zone-manager — Policy Zone Manager (PZM), the UX + automation suite that makes Policy Zones adoption tractable across Meta's polyglot multi-thousand-engineer codebase. Four-step workflow: identify assets → discover downstream flows (via lineage) → remediate violations (three cases: safe / unsafe / reclassified) → continuously enforce + monitor via verifiers. The lesson "Build tools; they are required" is directly about PZM: it reduces engineering effort for purpose-limitation rollout "by orders of magnitude."
systems/meta-data-classifier — Meta's ML-based data classifier (first published 2020-07-21) that automatically identifies sensitive data assets at scale. Named in the 2024-08-31 post as the Step-1 auto-discovery input to PZM: "we heavily rely on various techniques such as our scalable ML-based classifier to automatically identify data assets."
systems/hhvm — Meta's HipHop Virtual Machine (runs Hack + PHP); Meta's canonical function-based system where Policy Zones is integrated at the call-tree / request-zone granularity. Hack named as one of the three host languages (alongside C++, Python) for PAI runtime libraries.

Incident response + investigation tooling (2024-08-23 RCA post)¶

systems/meta-rca-system — Meta's AI-assisted root-cause analysis system for web-monorepo investigations. Two-stage architecture: heuristic retriever (code + directory ownership + runtime code-graph) narrows thousands of recent changes to a few hundred; a fine-tuned Llama 2 (7B) ranks via ranking-via-election (B=20, K=5) to a top-5. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Built with CPT on internal wikis/Q&As/code + mixed SFT with a dedicated 5,000-example RCA-SFT set + a second SFT round teaching logprob-rankable ordered-list output.
systems/hawkeye-meta — Meta's predecessor AI investigation tool (December 2023) for end-to-end ML-workflow debugging. Stub on this wiki; named as the prior generation in Meta's investigation-tooling lineage.
systems/llama-2 — Meta's July-2023 open-weight foundation model family. The 7B variant is the base of the RCA ranker; predecessor of systems/llama-3 / systems/llama-3-1.

Hyperscale benchmarking + hardware co-design (2024-08-05 DCPerf post)¶

systems/dcperf — Meta's open-source benchmark suite for hyperscale compute workloads. Each benchmark anchored to a real Meta production application. Representativeness validated at microarchitectural level (IPC + core-frequency comparison vs production apps and SPEC CPU). Multi-ISA (x86 + ARM), extended for chiplet-based architectures and multi-tenancy / rising core counts. Used internally alongside SPEC CPU. github.com/facebookresearch/DCPerf.
systems/spec-cpu — the industry-standard incumbent DCPerf supplements. Meta's published IPC + frequency graphs are direct evidence of benchmark bias for hyperscale workloads — SPEC CPU under-represents the microarchitectural behaviour of Meta production applications. DCPerf does not replace SPEC CPU; they're used together.

GenAI training substrate (2024-06-12 post)¶

systems/grand-teton — Meta's open-sourced AI training server platform, modified in 2024 to host H100 at 700 W TDP with HBM3, retained air cooling because data-center cooling infrastructure could not change in time.
systems/meta-genai-cluster-roce — one of two 24,000-GPU H100 clusters. Uses RoCE as the inter-node fabric; optimised for fast build time. The largest Llama 3 model was trained on this cluster.
systems/meta-genai-cluster-infiniband — the sibling 24K-GPU H100 cluster on InfiniBand. Evolved from Meta's 16K-GPU AI Research SuperCluster into a production-integrated build; optimised for full-bisection bandwidth.
systems/roce-rdma-over-converged-ethernet — the Ethernet-native RDMA fabric Meta scaled 4K → 24K GPUs for production AI. SIGCOMM 2024 paper adds: DCQCN off at 400G, PFC+admission instead; E-ECMP+QP scaling gains 40% on AllReduce.
systems/ai-zone — Meta's two-stage Clos template (RTSW leaf + CTSW deep-buffer spine + optional ATSW aggregator); non-blocking inside the Zone, oversubscribed across Zones, topology-aware scheduler compensates.
systems/llama-3 — trained on both 24K-GPU clusters in parallel.
systems/llama-3-1 — Llama 3.1 (incl. 405B) trained on the same substrate; flagship workload of the 2024-08-05 SIGCOMM paper.

Fleet maintenance + production engineering (2024-06-16 post)¶

systems/opsplanner — Meta's unified disruptive-work orchestrator; single choke point for every maintenance operation across the fleet. Handles ~1 million operations per day. Owns overlap serialisation, safe drain/return to production, handover flow (avoiding overlaps and deadlocks), pre-return verification, and the planned-maintenance + failure buffers. Used for all Meta capacity — compute and storage — not only AI.

Disaster recovery + region-scale resilience (2026-06-03 PowerLoss Storm post)¶

systems/meta-instantaneous-powerloss-storm — new canonical wiki page for Meta's Instantaneous PowerLoss Storm testing paradigm that validates zero-notice complete power loss of an entire DC region. Follows an incremental blast-radius validation strategy: pre-production → shadow → smallest production → large production regions with critical workloads.
systems/meta-twine — new canonical wiki page for Meta's core container orchestrator (Scheduler, Allocator, Broker, Zelos). Central to region-bootstrap recovery; faces bootstrapping circular dependencies solved via Belljar + Twrko belt-and-braces approach.
systems/meta-power-loss-siren — new canonical wiki page for Meta's battery-backed data-persistence system enabling in-memory data flush during rack power loss.
systems/meta-belljar — new canonical wiki page for Meta's CI/CD framework continuously detecting critical startup dependencies among control-plane services before production deployment.

Data warehouse + query + scheduling¶

systems/presto — distributed SQL query engine, still actively operated at Meta-scale across "tens of thousands of machines" in
The open-source Presto/Trino split (2020) did not retire Meta's internal deployment.
systems/meta-presto-gateway — Meta's internal Gateway fronting every Presto query. Hardened with per-dimension throttling and autoscaling after early outages from unintended DDoS-style internal traffic.
systems/meta-data-warehouse — the multi-datacenter data lakehouse Presto serves; its hardware-provisioning pipeline drives automated Presto cluster standup/decommission.
systems/tupperware — Meta's container/cluster management system (named as an integration hook for Presto cluster turn-up automation).
systems/meta-ptp — Precision Time Protocol deployment at Meta, surfaced via the High Scalability Dec-2022 roundup.
systems/owl-content-distribution — Meta's 800 PB/day centralized-control peer-to-peer content distribution, surfaced via High Scalability.

Real-time communication + audio (2024-06-13 post)¶

systems/mlow-codec — MLow (Meta Low Bitrate), Meta's classical-DSP RTC audio codec. Built on CELP with split-band encoding and range-coded output. 2× POLQA MOS over Opus at 6 kbps, 10% lower compute, SuperWideBand at low bitrate, inband FEC viable at 14 kbps (Opus floor: 19 kbps). Fully shipped on Instagram + Messenger; rolling out on WhatsApp. End-to-end-encrypted. Development timeline: late 2021 → mid-2024.
systems/opus-codec — the 2012 open-source codec Meta used for all RTC before MLow. Reference benchmark point. NarrowBand at its 6 kbps floor.
systems/meta-encodec — Meta's ML-based audio codec (FAIR, October 2022). High quality at low bitrate but "only the very high-end (expensive) mobile handsets are able to run these codecs reliably". The canonical counterexample Meta chose not to ship for RTC in favour of MLow.

Source control + monorepo VCS (2024-09-10 Sapling post)¶

systems/sapling-scm — Meta's source control system (client open-sourced 2022-11-15, announcement post ingested 2024-09-10). Mercurial-lineage, 10-year internal development, scales to "tens of millions of files, tens of millions of commits, and tens of millions of branches." Not a Git fork, though the open-source client speaks Git. sapling-scm.com.
systems/sapling-smartlog — the sl default view. Load-bearing UX primitive; hides irrelevant history behind a dashed line, shows local commits + relevant remotes. Interactive web UI via sl web.
systems/meta-segmented-changelog — Sapling's commit-graph-shape index, downloaded in "just a few megabytes", enabling O(log n) log/blame via segment-graph bisection even on Git-backed repos. Paired with lazy history download for monorepo-scale VCS.
systems/sapling-virtual-fs — Sapling's virtual file system for working-copy scale. Not yet open-sourced as of 2022-11-15. Presents the full repo shape; fetches files lazily on first access; prefetches per-project.
systems/sapling-server — the Rust-implemented Sapling-compatible server. Not yet open-sourced. Substrate for the server-dependent scale features: lazy history download, per-file history graphs, VFS, Commit Cloud, and future Sapling-served Git repos.
systems/commit-cloud-meta — Meta's commit-cloud preview: all commits auto-uploaded; sharing via commit hash + sl goto HASH. Server-dependent, not yet open-sourced.
systems/reviewstack — demo stack-oriented code-review UI for GitHub pull requests at reviewstack.dev. Critique of GitHub's non-stack PR review model.
systems/mercurial — Sapling's open-source ancestor; Sapling started as a Mercurial extension and credits Mercurial's Evolve extension as direct inspiration for mutation history tracking.
systems/watchman — Meta's open-source file-system monitor. Used by Sapling's sl status to avoid full working-copy scans when the VFS isn't deployed.

Developer tools + code indexing (2024-12-19 Glean post)¶

systems/glean — Meta's open-source code-indexing system (open-sourced August 2021). Collects + stores + queries structured facts about source code; one shared, centralized, network-queryable index powers code browsing, code search, auto-generated docs, code review, IDE augmentation, dead-code detection, API-migration tracking, test selection, automated data removal, and RAG in AI coding assistants. Storage: RocksDB. Indexing + query service both distributed; databases replicated across query service machines + centrally backed up. Canonical wiki instance of centralized ahead-of-time indexing. glean.software · github.com/facebookincubator/Glean.
systems/angle-query-language — Glean's declarative logic-based query language (anagram of Glean, "to fish"). Predicates ≈ SQL tables; facts ≈ rows; derived predicates = SQL-view analogue. Schema-level derivation is the mechanism behind language-neutral schema abstraction — language-specific facts underneath, cross-language views projected over them in the schema itself. Prefix-indexed queries over predicate-declared field order; transitive-closure queries (e.g. C++ #include fanout) expressed as fixpoint queries. Published latencies: ~1 ms for name+namespace lookup; few-ms first-results on inheritance-chain queries with incremental streaming of the rest.
systems/glass-symbol-server — Glean's symbol server; one-API-call navigation surface (documentSymbols(repo, path, revision)) behind which language-specific schemas live. Owns per-language symbol IDs that stay stable under code motion. Drives Meta's code browser (embedded Monaco editor), Phabricator review-time navigation, Find References, Call Hierarchy, and symbol-URL-stable documentation rendering. Open source at glean/glass.
systems/rocksdb — Meta's open-source LSM-tree-based embedded key-value store (rocksdb.org); fact-storage substrate for Glean. Immutable SSTable grain aligns naturally with Glean's stacked-immutable-database layering.
systems/lsif — Language Server Index Format, the Microsoft-led LSP-ecosystem code-indexing incumbent Glean deliberately generalises past: Glean is not tied to any one language or any one use case, where LSIF's feature set is shaped by LSP operations.
systems/monaco-editor — Microsoft's VS Code editor as an embeddable component; Meta's internal code browser uses Monaco as its surface, calling Glass for outline + nav rendering.
systems/phabricator — Meta's code-review tool; review-time accurate go-to-definition + type-on-hover + documentation on the diff itself is driven by Glean diff sketches and Glass APIs, covering C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell.

Anti-abuse rule engine (2015-06-26 Haskell post)¶

for proactively detecting spam, phishing, and malware on Facebook. Every user interaction is evaluated against a set of policies specific to that interaction type; "bad content detected by Sigma is removed automatically so that it doesn't show up in your News Feed." Post-2015-rewrite throughput: >1M rps. Architecture: Haskell sandwiched between two layers of C++ — C++ Thrift server above; C++ service clients below wrapped as Haxl data sources via FFI. Policies are continuously deployed from the repository; "the source code in the repository is the code running in Sigma." implicit concurrent data fetching. Automatically batches calls to the same data source and overlaps calls to distinct data sources without the programmer writing explicit concurrency constructs. GitHub: facebook/Haxl. ICFP 2014 paper: There is no fork: an abstraction for efficient, concurrent, and concise data access. Meta-contributed features: Applicative do-notation (the compiler half of Haxl's rearrangement); heap-management changes reducing GC frequency on multicore; per-thread allocation limits via asynchronous exceptions; GC changes to detect unreferenced old code for safe hot-swap unload; a finalizer fix for clean shutdowns; a GC crash fix for a bug "gone undetected in GHC for several years." policy authoring. Purely functional + strongly typed + mature optimising compiler (GHC) + interactive environment (GHCi) + rich ecosystem. Canonical wiki anchor. moved to after direct Cabal/Hackage use produced cascading version yak-shaves. Canonical wiki datum on curated-set ecosystems vs. SemVer-resolved ecosystems: author-declared SemVer constraints are not a substitute for an externally-curated compatibility matrix at large-dependency-graph scale. Haskell replaced at Sigma. Interpreted (slow); lacked user-defined types + modules. Cautionary datum: in-house DSLs stop paying their cost when complexity growth outruns expressivity and interpreter performance caps hardware utilisation** — both conditions together justified migration.

systems/facebook-reels — Meta's short-form vertical-video surface inside the Facebook app. The surface hosting Friend Bubbles. Engineered as a performance-sensitive surface with three nonnegotiable client constraints (smooth scrolling, no load-latency regression, low CPU overhead). Has an existing optimised video-prefetch window that new per-video metadata piggybacks on (concepts/prefetch-window-metadata-coattending). "First short-form-video recommendation-surface source" on the wiki.
systems/meta-friend-bubbles — the recommendation-system component surfacing Reels a viewer's friends interacted with, rendered as tappable avatar bubbles. Three architectural layers: viewer-friend closeness (two complementary ML models — survey-trained weekly-trillions + context-specific bubble-click), retrieval-ranking funnel modifications (friend-interacted content explicitly retrieved + new features/tasks in both early- and late-stage MTML rankers with a conditional-probability objective P(engage | bubble impression)), and client-side integration (prefetch-window co-attending + conditional animation + conservative prevalence gating). Continuous feedback loop re-trains on bubble-interaction data. "First recommendation-system canonical instance" on the wiki paired with the LLM-ranker sibling systems/meta-rca-system.

Key operational patterns surfaced at Meta¶

Foundational-OSS stewardship patterns (2026-03-02 jemalloc post)¶

patterns/stewardship-reset-for-foundational-oss — new canonical pattern. When a highly-leveraged foundational OSS project has effectively single-vendor stewardship, the stewarding org's short-term product incentives can corrode long-term engineering principles without external correction signal. Meta's 2026 jemalloc post is the canonical wiki instance: (1) admit the drift publicly, (2) re-engage the community on the record (including founder Jason Evans), (3) restore visible community infrastructure (unarchive the repository), (4) publish a forward-looking technical roadmap. Reset evaluated over years of shipped work, not at announcement time.
patterns/upstream-the-fix extended — eighth canonical instance, the upstream-steward-itself variant. The first seven instances (Cloudflare × 2, Datadog, Fly.io × 2, Cloudflare arm64, Meta FFmpeg) are downstream-consumer cases: "I found a bug in an ecosystem primitive; I fix it upstream." Meta's jemalloc reset is the dual: "I am the upstream and I have drifted; here is how I return to discipline." Consumer-side and steward-side together form the two-sided discipline of foundational-OSS maintenance.
concepts/huge-page-allocator — new concept. The allocator subsystem responsible for serving allocations backed by huge pages (2 MiB+) rather than 4 KiB base pages. Targets TLB-miss reduction at hyperscale. jemalloc's HPA is the canonical instance on the wiki.
concepts/transparent-huge-pages — new concept. Linux kernel feature (since 2.6.38) that transparently promotes contiguous 4 KiB pages to 2 MiB pages without requiring hugetlbfs. The zero-config path to huge-page CPU efficiency; jemalloc's HPA primarily targets THP.

WhatsApp Rust-at-scale + media-defense patterns (2026-01-27 Rust-at-scale post)¶

patterns/parallel-rewrite-with-differential-testing — canonical wiki pattern for large C/C++→Rust rewrites of well-specified cross-platform libraries. New implementation runs alongside the old; differential fuzzing enforces parity; staged cutover once parity + memory/performance advantages are demonstrated. Sibling to — but distinct from — patterns/ai-driven-framework-rewrite (Cloudflare vinext; external-API oracle), patterns/rust-replacement-of-dynamic-language-hot-path (Cloudflare FL1→FL2; dynamic-to-static axis), and patterns/language-rewrite-for-concurrency (Dropbox Feast → Go, DSQL JVM → Rust; concurrency-model axis).
patterns/memory-safe-language-for-untrusted-input — the design rule: any code path that processes untrusted input automatically (no user step) should be written in a memory-safe language. wamedia is the canonical client-side instance; Aurora DSQL extensions + Dropbox Nucleus are the server-side + sync-engine siblings.
patterns/format-aware-malware-check-before-os-handoff — canonical wiki pattern for Stagefright-class ungovernable-OS-library mitigations. App validates format + spoof + risk + dangerous-type before handing bytes to the OS parser it cannot patch.
patterns/bug-bounty-research-proxy — vendor publishes a research-facilitating tool (here: the WhatsApp Research Proxy) to lower the barrier for external protocol research while concentrating that research at a controlled endpoint.
concepts/memory-safety extended — the client-side cross-platform library instance, complementing the server-side (DSQL), managed-runtime (Datadog Go), and sync-engine (Nucleus) instances already on the wiki.
concepts/parser-differential extended — media-file / OS-library variant of the attack class, plus the canonical app-layer defensive posture ("one parser in front of many ungovernable parsers, reject divergent inputs"). Complements the ruby-saml "one parser for security boundaries" posture.
concepts/differential-fuzzing — canonical wiki first instance (new concept page).
concepts/attack-surface-minimization — Meta's first-of-three pillars (verbatim: "Design the product to minimize unnecessary attack surface exposure"), canonicalised as a new wiki concept.
concepts/os-library-vulnerability-ungovernable — new concept; the architectural forcing function (OS libraries are outside the app's patch authority).
concepts/patch-lag — new concept; user-side delay between fix release and installed-base update (measured in months for mobile OS libraries).
concepts/format-conformance-check — new concept; primitive check family inside Kaleidoscope.
concepts/file-type-spoofing — new concept; sibling check family inside Kaleidoscope.
concepts/cross-platform-client-library — new concept; wamedia's design axis. Captures the build-system + binary-size tax that cross-platform Rust on mobile pays.
concepts/defense-in-depth extended — the client-side / media-processing / ungovernable-downstream-parser variant.
concepts/binary-size-bloat extended — the mobile-distribution-channel-constraint variant. Counter-intuitive source-code result (LoC inverted) paired with the real binary-size tax of Rust's stdlib on mobile.

Flash media tiering + QLC software stack (2025-03-04 QLC post)¶

patterns/middle-tier-storage-media — canonical Meta instance: insert QLC between HDD and TLC when the lower tier's BW/TB has fallen below workload needs and the upper tier is overpaid for the gap band. Discipline: match target workload shape (read-BW-intensive, batch IO, low-write) to the new media's strengths; accept non-TCO-parity at launch if density + power-efficiency justify it.
patterns/userspace-ftl-via-io-uring — ublk + io_uring + userspace FTL stack. Pure Storage's DirectFlash software is the canonical 2025 instance; the pattern generalises to any vendor-specific flash management that benefits from host-side visibility + control. For standard NVMe QLC SSDs the stack simplifies to io_uring directly against the NVMe block device.
patterns/rate-controller-for-asymmetric-media — software-side arbitration for media with asymmetric R/W throughput. QLC's 4×+ read-vs-write gap forces host-side scheduling of writes so latency-sensitive reads don't queue behind them. Composes with userspace-FTL because that's where the scheduler has full pending-write visibility.
concepts/bandwidth-per-terabyte — the organising axis for Meta's three-tier HDD/QLC/TLC hierarchy. Canonical wiki first instance with explicit bands (HDD ~5-10, QLC 10-20, TLC 50+ MB/s/TB).
concepts/storage-media-tiering — the general tier-structure concept. Meta's QLC post is the canonical wiki instance of inserting a new media tier via density + endurance + workload-match.
concepts/qlc-read-write-asymmetry — the media-level property that makes rate-control load-bearing.
concepts/write-endurance-nand — the historical QLC blocker, now managed via workload-matching.
patterns/co-design-with-ocp-partners (extended) — Meta × Pure Storage is a new bilateral co-design partnership on flash media, extending the Meta × Microsoft OCP lineage to a third orthogonal hardware subsystem (power → networking → GPU → flash).

Format-aware compression patterns (2025-10-06 OpenZL post)¶

patterns/offline-train-online-resolve-compression — canonical wiki pattern for format-aware compression: offline trainer consumes shape description + sample data → emits a Plan (possibly a Pareto set) → encoder resolves Plan to concrete Resolved Graph per frame → universal decoder reads Resolved Graph and executes. Training cost amortised over many frames; Plans are first-class config objects.
patterns/embedded-decode-recipe-in-frame — ship the Resolved Graph inside the frame so each frame is self-contained and any decoder instance can decode it without out-of-band config. The substrate for the universal decoder property.
patterns/fallback-to-general-purpose-compressor — safety net for inputs with no exploitable structure (pure text / unknown formats). OpenZL's worst case is zstd-equivalent, not worse, because the trainer can always select a Plan that reduces to "just run zstd." Canonical wiki reference. Sibling at higher abstraction level: concepts/llm-cascade (cheap-specialised-first, generalist-as-fallback).
patterns/graceful-upgrade-via-monoversion-decoder — keep the decoder binary version-stable across Plan evolution. Old Plan frames decode on any decoder; new Plan frames decode on any decoder; a decoder update (SIMD / bounds / scheduling) improves every frame including historic ones. Rolls out new Plans without waiting on consumer fleets.
concepts/format-aware-compression — the architectural category OpenZL defines. Canonical wiki concept.
concepts/universal-decoder — the decoder-side invariant the entire OpenZL architecture is organised around. Four enumerated deployment benefits: single audit surface, fleet-wide improvements, operational clarity, continuous training. Canonical wiki concept.
concepts/compression-plan — the config object that flows through the train→resolve→decode pipeline; replaces the scalar "compression level" knob with a serialised compressor graph.
concepts/reversible-transform-sequence — the pre-entropy-coding pipeline of reversible transforms applied per the Plan.
concepts/structure-of-arrays-decomposition — the canonical first transform (AoS → SoA) that makes per-field strategies possible.
concepts/delta-encoding — transform picked for mostly-sorted numeric streams. Canonical wiki concept.
concepts/tokenize-transform — transform picked for low-cardinality streams; emits dictionary + index list, each routed to its own subgraph. Canonical wiki concept.
concepts/runtime-control-point-compression — per-frame branch points inside a Plan; read lightweight statistics, pick a subgraph, record the choice in the frame. Adapts within-Plan without re-training; zero decoder complexity.

Open AI hardware + OCP patterns (2024-10-15 OCP AI-hardware-vision post)¶

patterns/open-hardware-for-ai-scaling — Meta's thesis: AI scale requires the hardware layer to move at the pace of the software layer, which requires open-source contribution rather than vendor-locked designs. Canonical wiki instance.
patterns/modular-rack-for-multi-accelerator — Grand Teton extended to NVIDIA H100 + AMD MI300X; Catalina extending the same principle to GB200 rack-scale. Preserved "single monolithic system design with fully integrated power, control, compute, and fabric interfaces" across accelerator vendors.
patterns/co-design-with-ocp-partners — Meta × Microsoft OCP lineage: SAI (2018) → OAM → Mount Diablo (2024). Bilateral collaboration as the operating model for cross-industry hardware standardization.
concepts/network-fabric-disaggregation — DSF as canonical wiki instance of the architectural stance of splitting a vertically-integrated fabric into open vendor-replaceable layers (ASIC / NOS / control-API / endpoint protocol / NICs).
concepts/liquid-cooled-ai-rack — Catalina at 140 kW breaking Meta from the air-cooled constraint that defined Grand-Teton-H100 in 2024-06.
concepts/injection-bandwidth-ai-cluster — Meta's forward projection: ~1 TB/s per accelerator injection bandwidth (> 10× today's networks).
concepts/bisection-bandwidth — projected to scale in lockstep with injection bandwidth ("equal normalized").
concepts/400-vdc-rack-power — Mount Diablo canonical instance of high-voltage DC power delivery at the rack level.
concepts/rack-level-power-density — extended via Catalina's 140 kW envelope, at the hyperscaler-AI end of the wiki's power-density spectrum (opposite end from Dropbox's ~16 kW air-cooled storage rack).

Privacy + information flow control patterns (2024-08-31 PAI post)¶

patterns/runtime-information-flow-enforcement — Meta's canonical statement of runtime IFC as the enforcement primitive for privacy constraints at hyperscale. Replaces point-checking controls + data lineage audit-and-ACL combo with runtime label-propagation + flow-rule evaluation integrated directly into HHVM / Presto / Spark. Canonical wiki reference.
patterns/logging-mode-to-enforcement-mode-rollout — the rollout pattern Policy Zones codifies: detect and record violations in production without blocking, remediate, then flip to enforcement. A correctness-constraint rollout pattern distinct from traffic-axis staged rollout; sibling of patterns/data-driven-allowlist-monitoring-mode on the security axis.
patterns/end-to-end-use-case-first — Meta's first "lesson learned": ship one concrete end-to-end use case through all integration targets before generalising the platform. Function-based-system design was too abstract for its first real large-scale customer; "refining the APIs and building missing operational support" was what made it work end-to-end. Reusable platform-bring-up discipline.
patterns/separate-annotation-from-requirement — Meta's fourth "lesson learned": keep data annotations simple (labels only — BANANA_DATA) and express per-requirement flow rules separately. A monolithic annotation API broke under multi-requirement composition with "data annotation conflicts that were difficult to resolve." Reusable schema-design pattern for IFC / policy systems.
concepts/information-flow-control — the classical primitive (Denning 1976) Meta adopts as its hyperscale privacy-enforcement primitive. First canonical industrial-IFC instance on the wiki.
concepts/purpose-limitation — the privacy principle PAI currently enforces. "Data is only processed for explicitly stated purposes."
concepts/point-checking-controls — the approach Meta outgrew: if statements + ACLs at the point of data access, requiring human audits and physical data separation.
concepts/data-annotation — the metadata-label primitive that makes runtime IFC operational on real codebases and data systems; granularity from table/column/row/cell (batch) to parameter/variable/return-value (function-based).
concepts/data-flow-violation — the runtime event Policy Zones detects; three remediation cases (safe / unsafe / reclassified) exposed via PZM.
concepts/logging-vs-enforcement-mode — two-phase enforcement-severity primitive underneath the rollout pattern.
concepts/policy-lattice — Denning's 1976 lattice model of security labels. Meta's representation + evaluation simplification is one lever for the "10× improvements in computational efficiency."
concepts/shift-left-privacy — the named engineering stance: privacy enforced at code execution + developer workflow, not at audit.
concepts/data-lineage — extended framing: discovery primitive (retained inside PZM Step 2) but explicitly rejected as a sufficient enforcement primitive at Meta scale.

Incident-response + AI-assisted RCA patterns (2024-08-23 RCA post)¶

patterns/retrieve-then-rank-llm — canonical wiki pattern: cheap heuristic retriever narrows the population, LLM ranker produces the top-K. Meta's RCA variant: directory-ownership + runtime-code-graph retrieval → Llama-2 ranking-via-election → top-5. Canonical wiki reference.
concepts/heuristic-retrieval — stage-1 retrieval via domain-rule-encoded narrowing (ownership metadata + runtime code-graph traversal). Cheap, interpretable, reproducible; depends on the monorepo substrate's structured ownership.
concepts/llm-based-ranker — the architectural role a fine-tuned LLM plays in the two-stage cascade. Output modes: natural-language top-K (via ranking-via-election) + logprob-ranked list (via a dedicated SFT format).
concepts/ranking-via-election — the tournament-style prompt-structure primitive for candidate sets larger than context window. Meta's B=20, K=5 configuration collapses ~few-hundred → 5 in O(log N) rounds. Canonical wiki reference.
concepts/supervised-fine-tuning — the task-teaching stage of Meta's training pipeline. Two-round SFT: mixed-SFT (original SFT data + internal context + RCA SFT dataset) + a second SFT round producing logprob-ranked ordered-list output. First canonical wiki SFT page.
patterns/closed-feedback-loop-ai-features — Meta's explicit safety discipline for employee-facing AI: reproducibility + explainability + feedback channel. Canonical wiki reference. "Responders can independently reproduce the results generated by our systems to validate their results."
patterns/confidence-thresholded-ai-output — the precision-over-reach primitive that pairs with the closed-feedback-loop discipline. "Confidence measurement methodologies to detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision." Canonical wiki reference.
concepts/automated-root-cause-analysis — the capability class. The 2023 Presto-analyzer framing (multi-source aggregation + rule-encoded heuristics + auto-remediation) and this 2024 LLM-ranker system are sibling realisations — same closed-feedback-loop posture + multi-source retrieval, different stage-2 reasoning substrate.
concepts/continued-pretraining — extended with Meta's small-base-model / proprietary-corpus RCA variant. Complements eBay's 2025 e-Llama continued-pretraining at 1T tokens on Llama 3.1 70B.
concepts/monorepo — the substrate whose structural affordance (structured ownership + code graph) makes heuristic retrieval tractable at scale; directly enables the RCA system's stage-1.

Fleet maintenance + production engineering patterns (2024-06-16 post)¶

patterns/maintenance-train — Meta's canonical fleet-maintenance primitive: cyclic small-batch drains of a maintenance domain, contract "all capacity minus one maintenance domain" up 24/7, bounded full-visit cycle time. Used for all Meta capacity — compute and storage, not only AI. Canonical wiki reference.
patterns/gradual-rollout-layered-by-stack-depth — two-layer rollout discipline: pin the job-facing layer (CUDA library + container) consistent across the cluster; slide the lower-level layer (firmware/drivers/kernel/OS) gradually through sliding-window rollouts. Reboot-required lower-layer upgrades take hours and cannot realistically be lock-stepped at Meta scale; container-layer restarts are cheap.
concepts/maintenance-domain — the sized unit of capacity drained per maintenance action. Sizing trade-off: smaller domain = smaller buffer cost vs larger domain = fewer interruptions. Meta tunes AI-capacity domains toward lower interruption rate because synchronised training jobs pay whole-job cost for any interruption.
concepts/overlapping-rollouts — Meta's explicit acceptance that at hyperscale, rollouts cannot be serialised into single-version-state windows. Architectural response: enforce pairwise component compatibility rather than cluster-wide coherence. Structural flip from small-scale practice.
concepts/host-consistency-sliding-upgrade — the two-layer discipline that makes overlapping rollouts safe for synchronised AI jobs. Upper layer (CUDA + job container) pinned cluster-wide; lower layer allowed to drift during rollout; compatibility matrix engineered per (pinned-upper × in-flight-lower) pair; pre-return verification gate enforces consistency at return-to-service.
concepts/fleet-patching — the capability class. Meta's variant is internal-compute-fleet with capacity-guarantee contract — distinct from MongoDB Atlas's managed-database variant, which has customer-facing windows but no capacity floor.
concepts/maintenance-window — the service-to-service contract. Meta's "contract with services that allows them to avoid maintenance-train interruptions, if possible" is the internal-service-to-service variant of the MongoDB-Atlas customer-to-vendor variant.
concepts/bad-host-detection — "bad hosts are very bad" is one of the five demanding GPU-training properties Meta names. Synchronised-job cost model: one bad host damages the whole job, not just its own share (superlinear vs proportional cost in the stateless-serving variant).

Hyperscale benchmarking + hardware co-design patterns (2024-08-05 DCPerf post)¶

patterns/workload-representative-benchmark-from-production — Meta's load-bearing design rule for DCPerf: "Each benchmark within DCPerf is designed by referencing a large application within Meta's production server fleet." Canonical wiki reference. Validated publicly via IPC + core-frequency comparison graphs. Hardware-evaluation sibling of the application-layer custom harness at Figma.
patterns/pre-silicon-validation-partnership — Meta's two-year collaboration with "leading CPU vendors" running DCPerf on pre-silicon / early-silicon setups, yielding "optimizations in areas such as CPU core microarchitecture settings and SOC power management optimizations." Open-sourcing DCPerf expands the pattern beyond Meta↔vendor to industry↔vendor.
concepts/hyperscale-compute-workload — canonical statement that hyperscale is a distinct workload market segment from HPC / traditional enterprise, with distinct microarchitectural behaviour. First wiki canonicalisation.
concepts/benchmark-representativeness — measurable-property concept DCPerf optimises for: match production IPC + frequency distributions. Inverse of concepts/benchmark-methodology-bias — Meta's evidence that SPEC CPU exhibits workload-shape bias for hyperscale is the representativeness-comparison graph.
concepts/benchmark-methodology-bias — extended by this post into the hyperscale-hardware-procurement axis (complementing the Cloudflare Workers iteration-level-noise axis). SPEC CPU is biased for hyperscale not because of a sampling error but because of its workload-origin population (HPC / enterprise).

GenAI training patterns (2024-06-12 post)¶

patterns/build-both-fabric-alternatives — Meta's decision to simultaneously build a RoCE 24K-GPU cluster and an InfiniBand 24K-GPU cluster to learn operationally rather than forecast the fabric tradeoff. Canonical wiki reference.
patterns/dedicated-backend-training-fabric — physical FE/BE rack wiring + dedicated BE Clos so the training fabric can evolve on its own schedule. Canonical wiki reference from 2024-08-05.
patterns/collective-library-transport-codesign — NCCL + RoCE transport + switch QoS as a single designed system. Meta's DCQCN-off posture only works because CTS-based admission + CTS-prioritised switch queues + deep-buffer CTSW + PFC all work together.
patterns/minimum-cut-training-job-placement — topology-aware scheduler computes min-cut partition across AI Zones + recommends rank assignments so cross-Zone traffic (oversubscribed ATSW layer) is minimised.
patterns/data-center-density-optimization — evict non-compute services (readers) from the GPU data hall; pack GPU racks densely within a single network cluster; accept air-cooling constraints when cooling infrastructure can't be changed in time.
concepts/hardware-reliability-at-scale — failure rate scales with GPU count; Meta's operational response is monitoring + automation + spare capacity.
concepts/gpu-training-failure-modes — GPU-falls-off-PCIe, DRAM/SRAM uncorrectable errors, network-cable failures; early-life biased distribution.
concepts/training-checkpoint — efficient preservation of training state as a named first-class scaling property at 24K-GPU scale.
concepts/collective-communication-topology-awareness — replacing default ring-allreduce with recursive-doubling/halving algorithms for latency-sensitive collectives at 24K-GPU scale.
concepts/fat-flow-load-balancing — training produces long-lived, high-bandwidth flows that defeat ECMP hashing; explicit routing + load balancing investment required, especially for the RoCE cluster. The 2024-08-05 SIGCOMM paper adds the full evolution timeline: baseline ECMP → concepts/path-pinning (failed under partial rack allocation; 30%+ degradation) → E-ECMP + QP scaling (+40% AllReduce).
concepts/ecmp-equal-cost-multipath / concepts/rdma-queue-pair / concepts/enhanced-ecmp-qp-scaling / concepts/path-pinning — the concepts that form Meta's RoCE routing stack.
concepts/dcqcn / concepts/priority-flow-control / concepts/receiver-driven-traffic-admission — Meta's unconventional congestion-control posture: DCQCN off, PFC-only + NCCL CTS handshake as library-level admission.
concepts/backend-frontend-network-separation — the dual-network rack wiring Meta uses to let the BE training fabric evolve independently.

Query/warehouse patterns¶

Canary + shadow cluster rollout — dual-stage deployment pipeline for Presto engine releases.
Bad-host auto-drain — attribute query failures per-host, alert on anomalous rates, auto-drain from the fleet.
Automated cluster standup/decommission — full hardware → serving cluster pipeline with no manual steps.
Gateway throttling by dimension — per-user, per-source, per-IP, and global query admission control at the Presto Gateway.
Gateway autoscaling — horizontal elasticity for the query-gateway tier under adversarial traffic.

RTC audio-codec patterns (2024-06-13 MLow post)¶

patterns/classic-dsp-over-ml-for-compute-constrained — canonical Meta example: built MLow (classical CELP + refinements) rather than ship Encodec (ML) because >20% of Meta calls are on ARMv7 devices and 10s of millions of daily WhatsApp calls are on 10+-year-old phones. ML codec quality doesn't matter if the target devices can't run it.
patterns/aggressive-fec-at-low-bitrate — MLow's lower bitrate floor creates FEC headroom; Meta spends that headroom on redundancy rather than fidelity. Future work explicitly targets "pumping out more redundant audio, which MLow allows us to do efficiently."
patterns/bandwidth-adaptive-codec-mode — RTC codec operating point is driven by a bandwidth-estimation module; lower operating floor is a first-class codec feature.
concepts/low-end-device-inclusion — the Meta-specific framing of the constraint: codec/ML/rendering choices must serve the low-end device population, not just flagship handsets.

Source-control + monorepo-VCS patterns (2024-09-10 Sapling post)¶

patterns/usability-first-vcs-cli — Sapling's canonical instance: design the VCS CLI so every command does one thing, defaults are sensible, the default view (sl alone) shows smartlog, and recovery commands (undo/redo/uncommit/unamend/hide/unhide) are first-class. Explicitly sized by Meta as a support-headcount multiplier — "the Sapling development team is small, and in order to support our tens of thousands of internal developers, we needed to make it as easy as possible to solve your own issues and get unblocked."
patterns/vcs-undo-tooling — treat undo as a subsystem, not a recovery procedure. Named commands + interactive sl undo -i scroller through old smartlog views. Post-quote: "never again should you have to delete your repository and clone again to get things working." Substrate: mutation history (Mercurial-Evolve-inspired).
patterns/first-class-commit-stack-workflow — Meta's unit of code review is the stack of small commits, not the pull request. Sapling makes this ergonomic via sl goto + sl amend
sl restack + sl absorb + sl amend --to COMMIT + sl fold + sl split, all safe under mutation-history tracking. Canonical wiki instance. Pairs with ReviewStack on the review-UI side.
patterns/lazy-history-on-demand — scale-side pattern: clone downloads almost nothing; history data is fetched on demand; queries stay fast via segment- graph bisection on a megabyte-scale Segmented Changelog index. Pattern-family cousin of patterns/blobless-clone-lazy-hydrate (Cloudflare Artifacts / ArtifactFS).
patterns/organization-owned-sparse-profile — the architectural move that makes sparse checkout operationally viable at scale: check the sparse-checkout profile into the repo; product teams own profile; engineers opt in by name; dependency changes update the profile; every engineer picks up the new state on next checkout / rebase. Canonical wiki instance — "thousands of engineers to work on a constantly shifting subset of the repository without ever having to think about it."
concepts/vcs-usability — the canonical wiki concept: usability in a VCS is a first-class, independent design axis orthogonal to scale. Sapling's thesis is that a VCS can invest in both simultaneously.
concepts/commit-stack — the reviewable-unit primitive.
concepts/mutation-history-commit — the substrate for stack-editing and undo.
concepts/lazy-history-download / concepts/segmented-changelog / concepts/commit-graph-bisection — the history-scale primitive trio.
concepts/virtual-filesystem-for-monorepo / concepts/sparse-checkout / concepts/sparse-profile — the working-copy-scale primitives.
concepts/monorepo — extended with the Meta-scale upper-bound framing as the regime where even a tuned Git on a managed SaaS stops being viable.

Developer tools + code-indexing patterns (2024-12-19 Glean post)¶

patterns/centralized-ahead-of-time-indexing — Meta's load-bearing decision to run code indexing on a shared fleet ahead of time, replicate the databases across the query service, and expose the index to clients over the network instead of downloading it. Canonical wiki instance = Glean.
patterns/language-neutral-schema-abstraction — keep the detailed language-specific schemas underneath, define language-neutral views over them in the schema language itself (Glean's derived predicates in Angle; analogous to SQL views). Canonical wiki instance = Glean + Angle. Lets Glass expose one uniform navigation API without forcing the underlying data to be lowest-common-denominator.
patterns/diff-based-static-analysis — index each diff to produce a machine-readable diff sketch, then fan out to static analysis, semantic lint, rich notifications, commit-level semantic search (production stack-trace → recent-touching-commits), and review-time go-to-definition inside Phabricator. Canonical wiki instance = Glean + Phabricator across 8 languages.
concepts/incremental-indexing — process just the changes; target O(changes), realistic floor O(fanout). Glean's position: the index is perpetually out of date at monorepo scale unless you commit to incrementality. Fanout closure itself is an Angle query.
concepts/stacked-immutable-databases — representation substrate for incremental indexing: non-destructive layered adds/hides on top of a base database, queried as a single logical view per revision, delta-sized storage. Mechanism disclosed but details deferred.
concepts/symbol-id — per-language stable string handle for symbols; URLs to docs + references survive code-motion. Glass owns the ID assignment.
concepts/code-indexing — the general primitive; Glean is the canonical wiki instance of the IDE-local-to-centralized shift.

Anti-abuse rule-engine patterns (2015-06-26 Haskell post)¶

"source code in the repository is the code running in Sigma" operational posture. Minutes from commit to fleet. Requires: a purely functional strongly typed policy language (type-correct-or-rejected at repo ingress), hot-code swapping runtime, policy-language performance within the request-path budget (no perf-critical logic trapped in the slower-deploying C++ layer), and interactive testing against production data. Canonical wiki instance = Meta Sigma. Haskell-between-two-layers-of-C++ integration pattern: mature C++ Thrift server on top, existing C++ service-client libraries below wrapped as Haxl data sources via FFI (compile-time C++ name-demangler removes intermediate C shim for most calls). Get the managed-runtime properties (purity, type safety, implicit-concurrent fetching, hot-swap) exactly where they pay off (the request-evaluation layer) without rewriting transport or client libraries. Canonical wiki instance = Meta Sigma. live policy reload. Sigma's three enabling conditions: short-lived requests (no in-flight migration), persistent-state code is never hot-swapped, GHC's reachability- based garbage collector detects when old compiled code is no longer referenced and triggers safe unload. Applicative do-notation together. The programmer writes pure functional code that looks sequential; the framework + compiler-level do-block rearrangement together batch same-source fetches and overlap fetches on different sources, without explicit concurrency constructs. Canonical industrial instance. Compiler-language co-design in service of a production concurrency property — a library alone cannot rearrange statements without changing the language. by the runtime, with safe termination via asynchronous exceptions. Meta added this to GHC upstream specifically for Sigma's multi-tenant request isolation. The thread is the blast-radius unit — a pathological request is terminated without affecting peer requests. Cooperative-scheduler sibling of OS-level RLIMITs; thread-scoped variant of concepts/blast-radius containment. property set Meta explicitly ties to operational safety: policies cannot interact, cannot crash the engine, are isolable for unit testing; strong types eliminate many bugs pre-production. Canonical wiki statement of language-as-production-safety- primitive** for rule engines; pairs with type-correct-at-ingress as the repo-level safety gate.

patterns/survey-trained-closeness-model — new canonical pattern. Train an ML closeness model (or any latent-relationship-quantity model) against refreshed direct-survey labels rather than platform-activity proxies. Use platform activity as features, use survey answers as labels — the separation that breaks optimise-to-proxy failure modes. Meta's survey is "lightweight binary" (close vs not close) with proxy questions (e.g. "how often do you communicate") as additional signal. Weekly inference over trillions of pairs. Extends the wiki's ground-truth-labelling family alongside patterns/human-calibrated-llm-labeling and patterns/human-in-the-loop-quality-sampling.
patterns/conditional-probability-ranking-objective — new canonical pattern. Add a new ranking signal to an existing multi-objective ranker as a conditional-probability term P(outcome | new condition) with a tunable weight, not a new formula. Compatible with MTML as a new head. Meta's instance: P(video engagement | bubble impression) balances social-interaction with video-engagement objectives without abandoning existing tuning.
patterns/conditional-animation-for-scroll-performance — new canonical pattern. Gate UI animations on two conditions — interaction state (off during active scroll) + device class (off entirely on low-end devices) — rather than animating unconditionally. Treat animation as a budget item. Extends Meta's low-end-device inclusion posture (previously MLow-audio-codec-only) to client-UI-rendering; complements prefetch-window metadata co-attending as the fetch-side sibling discipline.
patterns/closed-feedback-loop-ai-features — extended with the Friend Bubbles recommendation-system instance. Bubble-impression + engagement data flows back into training so MTML models keep learning friend-content resonance. Now canonical across RCA + Kotlinator + Friend Bubbles — three distinct Meta product domains.
patterns/retrieve-then-rank-llm — extended: Friend Bubbles is the recommendation-system + MTML-ranker sibling instance to the RCA-system + LLM-ranker canonical instance. Both use heuristic stage-1 retrieval + a heavier stage-2 ranker; both demonstrate the retriever-recall-is-the-ceiling principle Meta states explicitly in Friend Bubbles ("high-quality friend content may never enter the ranking pipeline in the first place").
concepts/retrieval-ranking-funnel — new concept. The canonical two-stage recommendation-system architecture generalised from the LLM-specific pattern. Explicit top-of-funnel expansion when a new candidate class is missing.
concepts/viewer-friend-closeness — new concept. The ML-estimated social-relationship strength used as retrieval threshold + ranker feature. Weekly inference over trillions of pairs; canonical precomputed-feature pairing with concepts/feature-store.
concepts/multi-task-multi-label-ranking — new concept. The ranker-architecture class at both early- and late-stage Reels ranking, the natural host for new conditional-probability tasks.
concepts/prefetch-window-metadata-coattending — new concept. The client-side primitive piggybacking new per-video metadata on the existing video-prefetch path so scroll + playback don't regress.

Tier classification¶

Tier 1 — canonical hyperscale engineering source; systems and patterns surfaced from Meta posts cross-reference heavily with AWS, Google, Netflix, Cloudflare, LinkedIn.