Meta (Facebook)¶
Meta (formerly Facebook) is a hyperscale social / advertising / VR / AI company whose engineering blog (engineering.fb.com) is a Tier-1 source on the sysdesign-wiki. Distinctive Meta systems frequently cited elsewhere in the corpus: the Presto-fronted data warehouse, Tupperware (containers/cluster management), TAO (graph store), Haystack (photo storage), PTP (precision time), Owl (content distribution), and — as of 2024 — the Grand Teton H100 platform underpinning Meta's paired 24K-GPU RoCE + InfiniBand GenAI training clusters on which Llama 3 was trained.
For this wiki, Meta pages primarily come through two channels:
- First-party Meta Engineering posts — architectural deep-dives, data warehouse / infra / ML / networking internals.
- Meta-authored pieces republished on secondary aggregators like High Scalability — operational retrospectives co-authored by Meta production engineers.
Recent articles¶
-
2026-04-16 — Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways (first-party Meta Engineering; Security) — Meta's Security team publishes a programme-level strategy paper on its multi-year PQC migration. Headline governance primitive: the PQC Migration Levels ladder — PQ-Unaware → PQ-Aware → PQ-Ready → PQ-Hardened → PQ-Enabled — organised around time to react to a relevant quantum event (shorter is better). Even PQ-Ready — "not a desirable end goal given the fact it is not yet protecting the use case against quantum attacks" — is valuable because it "reduces the time to react." PQ-Hardened exists for use cases where literature gaps (efficient PQ-OPRFs) prevent full enablement today. Three-tier prioritisation framework classifies applications by attack class not asset value: High (offline-attackable via Shor — public-key encryption + key exchange — SNDL-vulnerable; split by external-dependency status), Medium (online-attackable via Shor — digital signatures — split by patching capability: hardware-baked keys vs software-upgradable), Low (Grover-only with inadequate parameters — symmetric). Cryptographic inventory via the canonical new patterns/automated-discovery-plus-developer-reporting pattern: Meta's 2024 Crypto Visibility service is the automated-discovery leg ("high-fidelity data on active usage within our primary libraries"), complemented by developer reporting for "edge cases or shadow dependencies" and "cryptographic intent for new architectures". Three external dependencies the migration blocks on (canonical extension of patterns/third-party-dependency-quantum-assessment with the consumer-side angle): (1) community-vetted PQC standards — NIST FIPS 203 / 204 / 205 + HQC drafting; IETF RFC 8554 / 8391 + TLS drafts; Meta co-authored HQC (NIST-selected 2025), BIKE, and Classical McEliece; (2) PQC support in hardware — Meta working with HSM + CPU vendors; (3) production-level implementations — Meta is a Linux Foundation PQCA member and contributes to LibOQS including bug fixes. Algorithm selection: stick to NIST-recommended — ML-KEM-768 default / ML-KEM-512 exception; ML-DSA-65 default / ML-DSA-44 exception; SPHINCS+ and Falcon "considerably harder" to deploy than ML-DSA; HQC is the non-lattice diversity hedge "if weaknesses are discovered in ML-KEM or its modular lattices approach." PQC guardrails (canonical new patterns/crypto-api-guardrails pattern) prevent new vulnerable-code creation via three layers: (1) update internal crypto guidelines; (2) friction on key-generation tooling for vulnerable primitives; (3) build-system rules in Buck that "warn teams during code review" on RSA / ECDH use — the code-review-gating posture applied to crypto APIs. Hybrid over replacement: Meta prioritises layering PQ on top of classical "designed so that the combined system should remain at least as secure as the current standard" — cites SIKE's 2022 cryptanalytic invalidation as precedent forcing caution during the transition period. Four migration principles: Effectiveness, Timeliness, Performance, Cost Efficiency — PQC as constrained-optimisation, not security-at-all-costs. Strategy-paper voice — no production deployment numbers (no fleet percentages, no latency data, no timelines beyond "multi-year"); no specific products named beyond "our internal infrastructure" (though acknowledgements enumerate Transport Security, WhatsApp, Facebook/Messenger, Infrastructure, Reality Labs, Hardware, Payments teams). First canonical PQC-migration-strategy paper on the wiki — complements the inventory-side (sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring|2024-12-02 monitoring), rollout-shape (GitHub / Cloudflare), and disclosure (Google) instances with the governance + program-management angle.
-
2026-04-16 — Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale (first-party Meta Engineering; Developer Tools) — Meta's Capacity Efficiency team describes a unified AI-agent platform (Capacity Efficiency Platform) for hyperscale performance engineering, built on two layers: MCP Tools (profiling queries · experiment results · configuration history · code search · documentation) + Skills (domain-expertise modules encoding senior-engineer reasoning patterns). Canonical wiki instance of patterns/mcp-tools-plus-skills-unified-platform — "both problems share the same structure... we didn't need two separate AI systems. We needed one platform that could serve both." The platform unifies offense (proactively finding + shipping optimizations) and defense (catching + mitigating regressions) with the same tool layer and different skills. On defense: the AI Regression Solver — a new component of FBDetect (Meta's in-house regression-detection tool, SOSP 2024, catching 0.005% regressions in noisy production, thousands weekly) — fully automates the path from detected regression to fix-forward PR sent to the root-cause author for review. Canonical patterns/ai-generated-fix-forward-pr instance replacing the rollback-vs-ignore binary with automated mitigation. Three-phase pipeline (shared with offense): gather context with tools → apply skill (e.g. "regressions from logging can be mitigated by increasing sampling") → create resolution. On offense: Opportunity Resolver — engineer views candidate optimization → requests AI-generated PR → agent gathers opportunity metadata + pattern docs + examples + files + validation criteria → applies skill (e.g. "memoizing a given function to reduce CPU usage") → produces fix with guardrails (syntax/style/right-issue) → surfaces in engineer's editor for one-click apply. Platform compounding: "within a year, the same foundation powered additional applications: conversational assistants for efficiency questions, capacity-planning agents, personalized opportunity recommendations, guided investigation workflows, and AI-assisted validation. Each new capability requires few to no new data integrations since they can just compose existing tools with new skills." Program-level impact: hundreds of megawatts of power recovered — "enough to power hundreds of thousands of American homes for a year"; investigation compression ~10 hours → ~30 minutes (~20×); AI handling the long tail of per-optimization work "engineers would never get to manually". The end goal is "a self-sustaining efficiency engine where AI handles the long tail." Third framing of capacity efficiency on the wiki alongside 2024-06-16 OpsPlanner (fleet-maintenance axis) and 2024-08-05 DCPerf (hardware-benchmarking axis) — this is the AI-assisted performance engineering axis. Fifth framing of patterns/specialized-agent-decomposition on the wiki — the skill-over-shared-tools composition model. First wiki ingest of FBDetect — Meta's regression detector previously unreferenced (SOSP 2024 paper at tangchq74.github.io/FBDetect-SOSP24.pdf). Sibling to 2026-04-06 Pre-Compute Engine post: both are 2026 Meta bets on markdown-level encoded knowledge as the model-agnostic substrate — compass-shape context files (offline-preloaded / descriptive) there, skills (runtime-invoked / prescriptive) here. Meta now has three complementary operational-AI systems on the wiki: RCA 2024-08-23 (ranker), Pre-Compute Engine 2026-04-06 (offline swarm), Capacity Efficiency Platform 2026-04-16 (tools+skills). Architecture-overview voice — no LLM/model identity, skill catalogue size, merge rate, revert rate, AI-vs-human offense/defense attribution, or guardrail decomposition disclosed.
- 2026-04-09 — Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases (first-party Meta Engineering; Developer Tools) — Multi-year retrospective on retiring Meta's years-divergent internal WebRTC fork across 50+ RTC use cases — Messenger video calling, Instagram calling, Cloud Gaming, Meta Quest VR casting. Two orthogonal solutions: Solution 1 — dual-stack shim layer: build two WebRTC copies (legacy + upstream) inside a single statically-linked binary gated by a runtime flavor enum, exposing one unified API to callers, so 50+ apps can A/B-test upstream releases against the legacy baseline per-call. Required resolving thousands of C++ One-Definition-Rule violations via automated namespace rewriting (
webrtc::→webrtc_latest::/webrtc_legacy::), plus C++usingdeclarations for backward compatibility, AST-based code generation lifting shim velocity from 1/day → 3–4/day, and a Buck-build target-duplication trick for injected components with deep WebRTC-internal dependencies (shimming would mean "proxying WebRTC against itself"). Binary-size choice matters: duplicating the higher call-orchestration library would have cost ~38 MB uncompressed; shimming at the lowest WebRTC layer cost ~5 MB — 87% reduction — canonical wiki datum for layer-placement-as-size-decision (extends concepts/binary-size-bloat). Shim scale: > 10,000 lines of new code; hundreds of thousands modified across thousands of files, "no major issues." Solution 2 — feature branches in an external Git repo: Meta's monorepo lacks the branching surface to track one-branch-per-internal-patch-per-upstream-release; resolution is an external Git repo based directly on libwebrtc's own tree, with tag-anchored branch naming (base/7499anchors Chromium M143 = libwebrtc tag 7499;debug-tools/7499,hw-av1-fixes/7499; merge forward to<feature>/7559; combine into release candidater7559). Four named benefits: parallelizable, preserves Git history, LLM-friendly for future auto-conflict-resolution, submit-ready upstream. Canonical wiki instance of the new patterns/external-feature-branch-repo-for-monorepo-patches pattern and the underlying concepts/feature-branch-patch-management concept. Outcomes: launchedwebrtc/latestat M120, now at M145 ("living at head"); up to 10% CPU drop; up to 3% crash-rate improvement; 100–200 KB compressed binary reduction per-app from upstream's own efficiency wins; deprecated insecure libraries (usrsctp). The shim remains in production as the ongoing-upgrade A/B substrate. Future work: AI agents for build-health fixes + auto-merge-conflict resolution across feature branches — the architecture is explicitly chosen for LLM-automation friendliness. Ninth canonical instance of patterns/upstream-the-fix on the wiki — its dual-stack-A/B-harness variant, orthogonal to the FFmpeg case's "upstream the features, then retire the fork" shape (seventh instance). Canonical pairing with the new pattern patterns/fork-retirement-via-ab-test + patterns/shim-for-dual-stack-ab-testing. Wiki's first canonical RTC / real-time-communication infrastructure source distinct from MLow (audio codec, not library substrate). Architecture-retrospective voice — team-size not disclosed beyond "a small team of engineers"; cost / duration figures absent; per-branch conflict counts + specific script internals abstracted; Buck-duplication path named but not walked-through in detail. Third Meta OSS-posture position disclosed in 2026 (alongside FFmpeg seventh-instance consumer-side upstream and jemalloc eighth-instance stewardship-reset upstream-side) — together the three posts canonicalise Meta's full-spectrum OSS discipline on the wiki. - 2026-04-06 — How Meta used AI to map tribal knowledge in large-scale data pipelines (first-party Meta Engineering; Developer Tools) — Meta's Data Platform team describes the AI Pre-Compute Engine that lifts AI-agent code-navigation coverage on a large config-as-code data pipeline (4 repos / 3 languages — Python + C++ + Hack / 4,100+ files / 6 synchronised subsystems per data-field change) from ~5% (5 files) to 100% (59 files). The mechanism: a single-session orchestration of 50+ specialised AI agents — 2 explorers → 11 module analysts → 2 writers → 10+ critics across 3 rounds → 4 fixers → 8 upgraders → 3 prompt testers (55+ queries × 5 personas) → 4 gap-fillers → 3 final critics — that reads every module, answers the five questions (what does this configure / common modification patterns / non-obvious build-failure patterns / cross-module deps / tribal knowledge in comments), and emits 59 compass-not-encyclopedia context files (25-35 lines · ~1,000 tokens each · 4 mandated sections: Quick Commands / Key Files / Non-Obvious Patterns / See Also). Entire knowledge layer < 0.1% of a modern model's context window. Canonical wiki instance of the new patterns/precomputed-agent-context-files pattern — extract tribal knowledge once, offline, via a multi-agent pass, consume it many times at request time. Quantitative outcomes (preliminary on 6 tasks): ~40% fewer tool calls and tokens per task; ~2 days → ~30 min for complex workflow guidance; 3.65 → 4.20 / 5.0 critic quality across 3 rounds; zero hallucinated file paths (enforced invariant); 55+ prompts at 100% pass; 50+ non-obvious patterns documented ("none of this had been written down before") — hidden intermediate naming conventions, append-only deprecated-enum rules, configuration-mode field-name mismatches. Cross-repo dependency index + data-flow maps turn "what depends on X?" from ~6,000 tokens of exploration into a ~200-token graph lookup (30× compression). Self-maintenance loop runs "every few weeks" (canonical wiki instance of patterns/self-maintaining-context-layer and the operational answer to context-file freshness — "context that goes stale causes more harm than no context"): validate file paths, detect coverage gaps, re-run critics, auto-fix stale references. Meta addresses the 2025 academic research that found AI-generated context files hurt agent success on Django / matplotlib: the pretraining-overlap asymmetry inverts on proprietary codebases — compass-shape + opt-in + quality-gated is the triple that avoids the academic pitfall. Multi-round critic quality gate canonical wiki instance (patterns/multi-round-critic-quality-gate) — distinct from runtime concepts/llm-as-judge by being applied pre-release on durable artifacts with a fixer stage between rounds. Orchestration layer routes engineers by natural language ("is the pipeline healthy?" → dashboard scanner + 85+ historical incident patterns from the Meta RCA 2024-08-23 lineage; "add a new data field" → multi-phase validation). Model-agnostic ("works with most leading models because the knowledge layer is model-agnostic") — markdown files, not a proprietary embedding; investment compounds across model upgrades. Fourth framing of patterns/specialized-agent-decomposition on the wiki alongside Storex (domain-based), Dash (sub-tool), DS-STAR (role-in-refinement-loop) — this is the offline-context-generation framing with nine pipeline-stage roles. Apply-it-yourself guidance (5 steps): identify tribal-knowledge gaps → use five-questions framework → compass-not-encyclopedia (25-35 lines) → quality gates via critic agents → automate freshness. Architecture-overview voice — no fleet-wide adoption numbers, total LLM-call count, wall-clock duration of the pre-compute pass, vendor / model version, or specific critic-score acceptance threshold disclosed. Results are one pipeline; Meta names expansion to additional pipelines in Future Work. First Meta ingest whose focus is offline context engineering for AI coding agents on the wiki — sibling to Glean (structured code-index facts) + diff-sketches as Meta's full-spectrum approach to making large proprietary code machine-navigable.
- 2026-03-31 — Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads (first-party Meta Engineering; ML Applications) — Meta Ads's architecture post on Meta Adaptive Ranking Model, the serving stack that scales Ads ranking to LLM-scale model complexity (O(10 GFLOPs) per token — "equivalent to the [...] compute used by top-tier LLMs") under sub-second latency and O(100 ms) bounded per-request. The frame is the inference trilemma — model complexity vs. latency vs. cost, where scaling any one axis naively breaks the other two — and Meta's resolution is three pillars: (1) request-centric inference — shift the unit of inference from (user, ad-candidate) pairs to (request); heavy user-context computed once per request via request-oriented computation sharing + in-kernel broadcast (transforms scaling "from linear to sub-linear"); long user sequences handled via request-oriented sequence scaling + centralised KV store of user logs joined with training data on the fly; plus the Wukong Turbo runtime evolution of Meta Ads's 2024 Wukong architecture adding no-bias numerical stability, small-parameter FSDP→DDP delegation, and sparsity-based linear-layer simplification; (2) model-system co-design — selective FP8 (post-training quantisation applied only to micro-benchmark-verified precision-tolerant layers, canonical selective mixed-precision instance) + operator fusion for shared inputs + Grouped GEMM + horizontal fusion consolidating "thousands of small operations" into compute-dense kernels, driven by hardware-aware model architecture — outcome: 35% MFU across multiple hardware types (canonical recsys-serving instance extending MFU beyond the existing Voyage AI embedding-inference datum); (3) multi-card sharded embedding serving — splits embedding tables exceeding single-GPU memory across an optimised hardware cluster, achieving performance parity with single-card setups and unblocking O(1T) parameter scale; combined with unified embeddings (multiple features share one table) and sparsity-aware + pruning allocation to manage the underlying hash-collision embedding tradeoff. Also disclosed: feature preprocessing offloaded from client CPU to remote GPU hosts with GPU-native kernels reducing Top-K from O(N log N) to O(N); accelerated model loading via multi-stream downloading + remote caching loads trillion-parameter models in under 10 minutes; auto-scaling on streaming multiprocessor utilisation. Launch: Instagram Q4 2025 — +3% ad conversions and +5% CTR for targeted users (see systems/meta-instagram). Future roadmap: ultra-low precision quantisation beyond FP8, agentic optimisation frameworks auto-adapting kernel performance, near-instantaneous model freshness via incremental in-place weight updates. Architecture-overview voice — no absolute QPS, fleet size, GPU count, inference p-tail, vendor mix (H100/B200/MI300X/MTIA undisclosed), prior-system baseline, FP8-selection benchmark metric, or cross-shard lookup latency disclosed; "+3%/+5%" is framed for targeted-user populations, not overall fleet. First LLM-scale recsys-serving ingest on the wiki — complements the existing recsys ingests (Meta Friend Bubbles recommendation architecture 2026-03-18; Meta RCA 2024-08-23 LLM retrieve-then-rank; Instacart generative recommendations) by focusing on the serving-stack / inference / GPU-systems axis rather than the model-architecture or feedback-loop axis. Second Wukong-family ingest after the 2024 Wukong paper (arXiv:2403.02545, not ingested); extends the Meta GPU-serving corpus (systems/grand-teton-era 2024 training posts) from training into LLM-scale inference serving.
- 2026-03-18 — Friend Bubbles: Enhancing Social Discovery on Facebook Reels (first-party Meta Engineering; ML Applications) — Meta's architecture overview of the Friend Bubbles recommendation-system component in Facebook Reels — the avatar bubbles that show which of a viewer's friends interacted with a Reel, with a tap opening a one-on-one conversation. Three architectural layers: (1) viewer-friend closeness — two complementary ML models, a survey-trained closeness model running weekly inference over "trillions of person-to-person connections across Facebook friends" (canonical wiki instance of patterns/survey-trained-closeness-model; binary close-vs-not-close prediction asked to randomly-sampled users directly, proxy questions like "how often two people communicate" as additional signal; features: mutual friends, connection strength, interaction patterns, user-provided location, friend count, posts shared) plus a context-specific closeness model trained on on-platform bubble-click signals; (2) retrieval → ranking funnel modifications — friend-interacted candidates are explicitly retrieved ("expand the top of the funnel to ensure sufficient candidate volume for downstream ranking stages" — canonical wiki statement of retriever-recall-is-the-ceiling in a recommendation system) and new features + new tasks are added to early-stage and late-stage MTML ranking models, with the objective augmented by a conditional-probability term
P(video engagement | bubble impression)balanced by tunable weights against existing watch/like/comment objectives. A continuous feedback loop re-trains on bubble-interaction data; (3) client-side integration with Reels performance constraints — bubble metadata retrieval pinned to the existing video prefetch window (canonical wiki instance of prefetch-window metadata co-attending: bubble data arrives alongside video content, reusing the already optimised fetch path, cache, and CPU budget — "eliminating mid-playback UI updates and redraws"), animation conditional on interaction state + device class (disabled during active scroll, disabled entirely on low-end devices — canonical wiki instance of patterns/conditional-animation-for-scroll-performance, extending Meta's low-end-device-inclusion posture from the MLow audio codec to the client-UI-rendering axis), and conservative prevalence gating ("bubbles show up only when the relationship signal ... is strong" — prevalence is not the optimisation target; engagement quality is). Qualitative outcomes: "higher interest scores and more positive sentiment ratings", "more time actively watching", "growth concentrated in longer sessions"; expressive reactions (love/laughter) drive stronger downstream engagement than likes; engagement scales with bubble count per video. Future work named: expansion to additional surfaces + inventory, cold-start improvements for sparse friend graphs, refined ranking + feedback signals. First recommendation-system ingest from Meta on the wiki — the 2024-08-23 Meta RCA post introduced the patterns/retrieve-then-rank-llm pattern in an RCA context; Friend Bubbles is the recommendation-system canonical instance of the same funnel primitive in an MTML (not LLM) stage-2 ranker. Architecture-preview voice — no fleet size, QPS, latency, A/B lift, MTML topology, feature-vector dimension, prefetch-window duration, or low-end-device threshold disclosed. Closeness-model composition (survey + context-specific) is named but not specified. Cold-start acknowledged as open problem. - 2026-03-02 — Investing in Infrastructure: Meta's Renewed Commitment to jemalloc (first-party Meta Engineering; Data Infrastructure; 515 HN points) — Meta's short stewardship-reset statement for jemalloc, the high-performance memory allocator it maintains upstream. Two substantive disclosures: (1) Meta acknowledges that "in recent years, there has been a gradual shift away from the core engineering principles that have long guided jemalloc's development" and the resulting technical debt slowed progress; (2) Meta has met with the community including founder Jason Evans, unarchived the original GitHub repository (archived in 2024 — the visible signal of the drift), and committed to a four-axis roadmap: technical debt reduction, HPA improvements for transparent huge pages (THP) CPU efficiency, memory efficiency (packing / caching / purging), and AArch64 (ARM64) out-of-box performance. Framing: foundational software components "alongside the Linux kernel and the compilers" need "the highest rigor" and "strong self-discipline as an organization to resist [short-term-benefit] temptation and adhere to the core engineering principles." Canonical wiki instance of the new patterns/stewardship-reset-for-foundational-oss pattern — the upstream-steward-itself sibling of patterns/upstream-the-fix (which is the downstream-consumer case). The consumer pattern works only when the steward is functioning; this pattern is what the steward does when it recognises it has drifted. Announcement voice — no shipping dates, no scope-of-refactor detail, no perf baselines or targets disclosed; reset is evaluated over years of shipped-roadmap work, not at announcement time. "We know that trust is earned through action." Cross-source framing: promotes the existing systems/jemalloc stub (previously only seen via the 2025-03-07 Strobelight memory-profiler-backend usage) to a first-class Meta foundational-software page. Extends patterns/upstream-the-fix with its eighth canonical instance as the upstream-steward-itself variant.
- 2026-03-09 — FFmpeg at Meta: Media Processing at Scale (first-party Meta Engineering; Video Engineering; 281 HN points) — Meta's Video Engineering team describes how Meta fully deprecated its internal FFmpeg fork for all DASH VOD + livestreaming pipelines by co-developing two load-bearing features upstream with FFlabs and VideoLAN over multiple releases: (1) threaded multi-lane transcoding — "the most complex refactoring of FFmpeg in decades" spanning FFmpeg 6.0 → 8.0 — that produces multiple DASH encodings from a single decode with per-frame encoder parallelism (see patterns/deduplicate-decode-across-encoder-lanes); (2) in-loop decoding (FFmpeg 7.0+) inserting a decoder after each encoder so reference quality metrics (PSNR/SSIM/VMAF — concepts/visual-quality-metric) can be computed in real time during livestreams (see concepts/in-loop-quality-metrics + patterns/in-loop-decoder-for-realtime-quality-metrics). Scale frame: Meta runs
ffmpeg+ffprobetens of billions of times per day; > 1 billion video uploads per day, each requiring multiple FFmpeg executions — so per-process efficiency wins compound to fleet-level savings, and carrying an internal fork is a long-term liability worth spending years upstream to remove. The opposite decision in the same post: Meta keeps its MSVP (Meta Scalable Video Processor) ASIC FFmpeg patches internal because MSVP hardware is Meta-only and external FFmpeg developers cannot validate changes without it; Meta accepts the reverse-rebase cost against each new upstream release. This introduces the complement pattern patterns/keep-infrastructure-specific-patches-internal to the wiki alongside patterns/upstream-the-fix — together they form the decision framework the post makes explicit. MSVP integrates into FFmpeg via the same hardware-accelerated video codec API that exposes NVIDIA NVDEC/NVENC, AMD UVD, and Intel Quick Sync Video. Seventh canonical instance of patterns/upstream-the-fix on the wiki — and the highest-stakes outcome to date: not a single targeted PR but a multi-year / multi-release collaboration culminating in fork retirement. First video-transcoding-infrastructure source on the wiki — opens a new technical domain distinct from the prior MLow audio codec (RTC audio) and the storage/GenAI/privacy/source-control corpus. No fleet CPU-savings numbers, codec mix (H.264/H.265/AV1), or ladder depth disclosed; MSVP's own architecture is linked to the separate 2023 MSVP post (not ingested). - 2026-01-28 — Rust at Scale: An Added Layer of Security for WhatsApp (first-party Meta Engineering; Security; 266 HN points) — WhatsApp security-team doctrine post disclosing the Rust rewrite of wamedia — the cross-platform media-consistency library that processes untrusted media automatically on download — and describing the four-family format-check ensemble Kaleidoscope that runs on top of it. Headline datum: "We replaced 160,000 lines of C++ (excluding tests) with 90,000 lines of Rust (including tests). The Rust version showed performance and runtime memory usage advantages over the C++." Scale claim: "the largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of" — billions of devices, WhatsApp + Messenger + Instagram, shipping monthly across Android / iOS / Mac / Web / Wearables. Forcing function (2015 Stagefright): "The bug lay in the processing of media files by operating system-provided libraries, so WhatsApp and other applications could not patch the underlying vulnerability" — canonical concepts/os-library-vulnerability-ungovernable + concepts/patch-lag instance. Meta's response was the format-aware malware check before OS handoff pattern — modify wamedia to detect non-conformant MP4s + refuse to forward. The 2026 Rust rewrite is the follow-on investment to ensure the checker itself is memory-safe — "because media checks run automatically on download and process untrusted inputs, we identified early on that wamedia was a prime candidate for using a memory safe language." Canonical patterns/memory-safe-language-for-untrusted-input instance. Rewrite methodology: parallel-rewrite-in-parallel, not incremental — "we developed the Rust version of wamedia in parallel with the original C++ version. We used differential fuzzing and extensive integration and unit tests to ensure compatibility between the two implementations." Canonical patterns/parallel-rewrite-with-differential-testing + concepts/differential-fuzzing instance. Kaleidoscope's four check families: (1) non-conformant-structure checks to defeat parser-differential exploits on downstream OS libraries; (2) risk-indicator checks inside higher-risk types (PDFs: embedded files + scripting); (3) file-type spoofing detection; (4) dangerous-type uniform flagging (executables/apps). Meta's three-pillar strategy (verbatim): (1) design the product to minimize unnecessary attack surface; (2) invest in security assurance for remaining C/C++ code; (3) "default the choice of memory safe languages, and not C and C++, for new code." Canonical concepts/attack-surface-minimization + concepts/defense-in-depth instance on the client-side / media-processing axis. Two disclosed hurdles: initial binary-size increase from the Rust stdlib (concepts/binary-size-bloat); build-system support for diverse platforms (concepts/cross-platform-client-library tax) — Meta calls this "a long-term bet to build that support." Meta's adjacent security posture: default E2EE for 3B+ daily users, E2E-encrypted backups, key transparency, calling protections, published CVEs even without evidence of exploitation, external audits (NCC Group's public assessment), fuzzing, static analysis, supply-chain management, automated attack-surface analysis, and the new WhatsApp Research Proxy — Meta's bug-bounty research-proxy primitive introduced via the 15th-anniversary Bug Bounty expansion. C/C++ remaining-code hardening named: CFI, hardened memory allocators, safer buffer APIs, specialised security training, automated security analysis, strict SLAs. Positioned within Meta's Rust adoption: "Security teams at WhatsApp and Meta are highlighting opportunities for high impact adoption of Rust to interested teams, and we anticipate accelerating adoption of Rust over the coming years." Thirteenth Meta first-party ingest and first canonical client-side Rust-at-scale source on the wiki, complementing the server-side Rust corpus (Aurora DSQL, Dropbox Nucleus, Cloudflare FL2).
- 2025-10-06 — Introducing OpenZL: An Open Source Format-Aware Compression Framework (first-party Meta Engineering, 434 HN points) — Meta open-sources OpenZL, a new lossless format-aware compression framework that targets structured data (tabular, columnar, numeric arrays, timeseries, ML tensors, database pages) and claims ratios "comparable to specialized compressors" while preserving a single universal decoder binary across every file it produces. Architectural response to the decade of Zstandard experience (2016-2025): generic compressors leave ratio on the table for structured data, but hand-rolled per-format compressors multiply the surface to ship + audit + patch + trust. OpenZL's resolution: push format-awareness into an input parameter + learned Plan that resolves to a decode recipe embedded in the frame. Headline numbers on Silesia
sao(M1/clang-17): zstd -3 = 5.5 MB / x1.31 / 220 MB/s comp / 850 MB/s decomp; xz -9 = 4.4 MB / x1.64 / 3.5 MB/s comp / 45 MB/s decomp; OpenZL = 3.5 MB / x2.06 / 340 MB/s comp / 1200 MB/s decomp — both higher-ratio than xz and faster in both directions. Six load-bearing architectural primitives: (1) structure as explicit input via SDDL (declarative) or a registered parser function, rather than guessed by the compressor; (2) trained Plan emitted by the OpenZL trainer from a budgeted search over transform choices + parameters, with an internal cluster finder (groups like-behaving fields) + graph explorer (scores candidate subgraphs); can emit a speed/ratio Pareto-set or target-under-constraint; (3) reversible transform sequence before entropy coding — thesaowalkthrough: split header → AoS → SoA → per-field transform pick (delta for mostly-sorted X-axis SRA0, transpose for bounded Y-axis SDEC0, tokenize for low-cardinality IS/MAG/XRPM/XDPM fields, each routed to its own subgraph); "The main work is to group data into homogeneous streams. After that, one can count on openzl to take care of the rest."; (4) per-frame resolution into a Resolved Graph recorded in the frame, enabling the universal decoder property — "even when the compression configuration changes, the decoder does not" — whose four enumerated benefits are single audit surface, fleet-wide improvements (SIMD / bounds / scheduling benefit every historic frame), operational clarity (same CLI + metrics + dashboards), and continuous training; (5) runtime control points — per-frame branch points reading lightweight statistics (string repetition, run-length, histogram skew, delta variance) and picking a subgraph "with zero complexity added to the decoder" because the chosen branch is recorded; (6) Managed Compression as the operational runtime — Meta's existing service that automated zstd-dictionary compression (2018) extends to OpenZL Plans: register use case → monitor → sample → re-train → roll out "like any other config change". Fallback safety net: when no structure is available (pure text — enwik, dickens) OpenZL falls back to zstd and delivers essentially zstd performance; CSV is a structural ceiling (~64 MB/s parse cap vs zstd's ~1 GB/s). Four Pareto-curve dataset categories evaluated against generic compressors: Silesiasao(AoS), ERA5 Flux (columnar 64-bit numeric), Binance + TLC Green Trip (uncompressed Parquet — OpenZL parses Parquet + learns schema), PPMF Unit (CSV — parse-bound). Open-source at github.com/facebook/openzl · whitepaper arXiv:2510.03203. First compression-framework source on the wiki — prior compression coverage was Meta MLow audio codec, MongoDB WiredTiger page compression, Cloudflare shared-dictionary HTTP compression (all specialized); OpenZL is the first general framework for format-aware compression to land on the wiki. - 2025-03-04 — A case for QLC SSDs in the data center (first-party Meta Engineering; Data Center Engineering) — Meta makes the architectural case for QLC NAND flash as a new middle storage tier between HDD and TLC flash — the first first-party disclosure of Meta's flash-media strategy on the wiki. Substance across four axes: (1) HDD BW/TB is dropping as density climbs without IOPS improvements — "bandwidth per TB for HDDs has been dropping" — stranding hot workloads on the cold tier. Canonical BW/TB framing on the wiki, extending concepts/hard-drive-physics's flat-IOPS observation to the bandwidth axis. (2) QLC's historical blockers closed — 2 Tb NAND dies + 32-die stacks mainstream; endurance matched via workload-matching (read-BW-intensive, low-write-BW targets); 6× density target over densest TLC server. Canonical concepts/storage-media-tiering + patterns/middle-tier-storage-media instance. (3) Form-factor argument: U.2-15mm wins (scales to 512 TB; accepts both standard NVMe QLC SSDs and Pure Storage DFMs at 600 TB); E1.S rejected (too small for QLC NAND-package count); E3 rejected (4-variant fragmentation). (4) Storage-software adaptation: canonical ublk + io_uring + userspace FTL stack for DFMs; standard NVMe path uses io_uring against NVMe block device directly. The 4×+ read-vs-write throughput asymmetry of QLC forces rate controllers + I/O schedulers so latency-sensitive reads don't queue behind writes. Co-design extension: Meta × Pure Storage is a new partner in the OCP co-design lineage (joining Microsoft, NVIDIA, AMD). Honest caveat: Meta states QLC "is not yet price competitive enough for a broader deployment" — the deployment is justified today by power efficiency + density, not TCO parity.
- 2025-04-30 — Building Private Processing for AI tools on WhatsApp (first-party Meta Engineering; Security) — Meta previews Private Processing, the confidential-computing infrastructure WhatsApp will use to run AI features (message summarisation + writing suggestions) over end-to-end-encrypted conversations without Meta, WhatsApp, or any third party seeing the plaintext. Architecture preview — launch projected "in the coming weeks" from 2025-04-30; detailed security engineering design paper promised at launch. The core architectural move: build LLM inference inside a TEE (CVM + Confidential-Compute-mode GPUs) whose binary digest is attested against a third-party-operated transparency ledger before the client releases its ephemeral session key — turning "we promise we run safe code" into a mechanism the client can verify. Canonical wiki instance of TEE-for-private-AI-inference — a structural alternative to both on-device inference (too constrained) and normal server-side inference (breaks E2EE). Five foundational requirements stacked: (1) confidential processing (no system outside Private Processing sees data in transit or in use); (2) enforceable guarantees (tamper attempts fail closed or become publicly discoverable); (3) verifiable transparency (published CVM binary + ledger + open-source + expanded Bug Bounty); (4) non-targetability (attacker cannot target a specific user without attacking the whole system); (5) stateless + forward-secure (no retained access after session). Wire-session chain (phases 1-6): anonymous credentials via Meta's Anonymous Credential Service (ACS) (open-sourced December 2022, now load-bearing) → HPKE keys from a third-party CDN → OHTTP through a third-party relay (strips client IP before Meta's gateway sees the request — canonical wiki instance of patterns/third-party-ohttp-relay-for-unlinkability) → RA-TLS session with attestation-verified-against-ledger before session-key release (canonical wiki instance of patterns/attestation-before-session-key-release) → ephemeral E2EE device↔CVM request → response returned under the same ephemeral key. CVM-to-CVM traffic reuses the same RA-TLS primitive — inter-CVM trust boundary is also attested. Threat modeling is the structural spine of the post: three named assets (messages + CVM TCB + keys), three actor classes (insiders, third-party / supply-chain vendors, malicious end users), three named scenarios (product-surface exploitation including prompt injection; CVM observability-side-channel extraction; insider physical/remote tampering at boot or runtime). Named operational controls: containerised hardened binaries inside the CVM + log-filtering egress (observability vs confidentiality tension explicitly named) + restricted-environment multi-party-reviewed CVM build + encrypted DRAM + physical datacentre controls + enhanced host monitoring outside the CVM + remote shell access prohibited. Canonical wiki instance of publish-binary-digest-ledger: the transparency-side companion to attestation — every acceptable CVM binary digest on an append-only third-party ledger plus the CVM image binary published, so external researchers can detect "attested X was never on the published list". Data minimisation operationalised at the request-API layer: the summarisation call carries only the messages the user directed AI to summarise — less content in means less blast-radius for any bug anywhere in the stack. User-control composition: Private Processing is opt-in at request granularity + users can disable AI features per-chat via the separate Advanced Chat Privacy feature. Canonical wiki instance of defence-in-depth for private-AI-inference on top of a TEE substrate — each requirement closes a distinct adversary move (runtime vs boot-time vs targeted-host vs post-session), each with different trust roots (CPU vendor / third-party ledger operator / third-party relay operator / external researchers) so compromise of any single party does not break the guarantee. Caveats: architecture-preview voice — no production numbers (latency, throughput, fleet size); TEE vendor not named (AMD SEV-SNP / Intel TDX / Arm CCA all plausible); confidential-GPU vendor/mode not named (NVIDIA Hopper CC is the obvious candidate); third-party relay + CDN + ledger operators not named; attestation protocol details not specified beyond "RA-TLS"; prompt-injection-specific defences flagged but not detailed; open-source scope gestured but not manifest.- 2024-12-19 — Indexing code at scale with Glean (first-party Meta Engineering, 132 HN points) — the canonical architecture overview of Glean, Meta's open-source code-indexing system (open-sourced August 2021). Canonical wiki instance of centralized ahead-of-time indexing — shared fleet indexes the monorepo, databases replicated across a widely-distributed query service, clients ask questions over the network instead of downloading the index. Four load-bearing architectural bets: (1) generality over use-case fit — Glean "doesn't decide for you what data you can store", each language owns its schema, non-code data is supported; RocksDB-backed; paid off with post-launch extensions to dead-code detection, build-graph analysis, API-migration tracking, test coverage + selection, automated data removal, and RAG in AI coding assistants; (2) Angle as declarative logic-based query language — predicates ≈ SQL tables, facts ≈ rows, derived predicates = SQL-view analogue — the mechanism that implements language-neutral schema abstraction (keep language-specific facts underneath, define cross-language views in the schema itself); published latencies ~1 ms for name+namespace lookup, few-ms first-results on inheritance-chain queries; (3) incremental indexing targeting O(changes), realistic floor O(fanout) — the C++ header-modify case names fanout as the transitive
#include-ers closure, computed as a fixpoint Angle query; implemented via stacked immutable databases — non-destructive layered adds/hides, single-view semantics per revision, multi-revision concurrent queries, delta-sized storage; (4) diff sketches — indexer runs on each diff to produce a machine-readable change summary (classes/methods/fields/calls added/removed/modified); fans out to static analysis, semantic lint, rich notifications, commit-level semantic search (stack-trace → recent-touching-commits), and review-time go-to-definition in Phabricator across C++/Python/PHP/JavaScript/Rust/Erlang/Thrift/Haskell — canonical wiki instance of diff-based static analysis. Companion subsystem Glass (open-source) = the symbol server on top of Glean; one API calldocumentSymbols(repo, path, revision)renders outline+nav for the code browser (embedded Monaco); owns per-language symbol IDs that stay stable under code motion, so doc URLs survive refactors. IDE augmentation hybrid — Meta's VS Code C++ extension serves go-to-def / find-references / hovercards from Glean at startup, blends with clangd as files load (C++ was the launch target "due to the long compile times"). Contrasted against LSIF — Glean is deliberately more general than the LSP-ecosystem incumbent. No fleet-size, QPS, or index-size numbers disclosed — scale claims qualitative; latency datapoints illustrative not load-tested; stacked-database deep-dive deferred to glean.software/blog/incremental/ (not ingested). - 2024-12-02 — How Meta built large-scale cryptographic monitoring (first-party Meta Engineering; CryptoEng team) — Meta's telemetry architecture underneath FBCrypto, Meta's managed cryptographic library. Logs every cryptographic operation fleet-wide — no sampling — so Meta can (a) detect key overuse + rotate proactively (concepts/key-overuse-detection), (b) inventory call-sites for deprecated-primitive + PQC migration scoping, (c) use call-volume / success-rate as a client-health proxy during rollouts. Scale datum: "roughly 0.05 % of CPU cycles at Meta are spent on X25519 key exchange" — forces the architecture. Core primitive: an aggregating buffered logger inside FBCrypto on
folly::ConcurrentHashMap— increment a (key-name, method, algorithm, …) tuple's counter on every operation, periodic background flush through Scribe to Scuba (warm) + Hive (cold). Three disclosed optimisations: (1) per-host first-flush jitter smooths cohort-synchronised write spikes to Scribe from hosts that restart together; (2) derived-key aggregation counts KDF-derived child-key operations against the parent keyset to bound cardinality for features that mint millions of keys (pessimistic for overuse-detection alarms — safe direction); (3) folly::Singleton-backed synchronous shutdown flush drains the in-memory buffer on job exit despite shutdown-environment constraints. Canonical wiki instance of unified-library-for-fleet-telemetry + the unified-library leverage strategic posture: FBCrypto + Scribe being monoculture means instrumenting once inside FBCrypto yields fleet-wide observability for free. Canonical wiki instance of concepts/cryptographic-monitoring + concepts/telemetry-buffer-and-flush. Extends concepts/post-quantum-cryptography with the fleet-inventory-producer framing — knowing where classical asymmetric primitives are in use is the migration-scoping prerequisite the rollout-shape posts (GitHub / Cloudflare) depend on. - 2024-10-15 — Meta's open AI hardware vision (first-party Meta Engineering, timed to OCP Global Summit 2024) — Meta contributes its next-generation AI-hardware stack to the Open Compute Project and projects forward. Four headline disclosures: (1) Catalina — a new 140 kW liquid-cooled AI rack on the ORv3 HPR chassis, hosting NVIDIA GB200 Grace Blackwell; modular, flexible, OCP-contributed; (2) Grand Teton extended to AMD Instinct MI300X — Meta's monolithic AI platform gains a second accelerator vendor, positioned for "large-scale AI inference workloads"; (3) Disaggregated Scheduled Fabric (DSF) — Meta's open vendor-agnostic AI backend, built on OCP-SAI + FBOSS + Ethernet-RoCE, with multi-vendor endpoint support (NVIDIA + Broadcom + AMD); companion silicon: new 51T fabric switches on Broadcom + Cisco ASICs and FBNIC, Meta's first in-house network ASIC; (4) Mount Diablo — Meta × Microsoft disaggregated 400 VDC power rack, continuing the long-standing Meta × Microsoft OCP co-design lineage (SAI 2018 → OAM → Mount Diablo 2024). Forward projection: ~1 TB/s per-accelerator injection bandwidth and matching bisection bandwidth — "more than an order of magnitude" growth over today's networks. Positions H100-based Grand Teton (2024-06 two-24K-GPU-cluster training substrate) as the predecessor generation; Catalina is the successor platform. Canonical wiki instance of open-hardware-for-AI-scaling thesis + modular-rack-for-multi-accelerator + co-design-with-OCP-partners patterns. Llama 3.1 405B training disclosed at > 16,000 H100 GPUs.
- 2024-09-10 — Sapling: Source control that's user-friendly and scalable (first-party Meta Engineering; post dated 2022-11-15, fetched into the wiki's raw corpus 2024-09-10) — Meta open-sources the Sapling client after "10 years" of internal development. Positioned as two orthogonal design axes (usability + scale) combined in one product. Scale claim: Sapling serves Meta's monorepo at "tens of millions of files, tens of millions of commits, and tens of millions of branches", a regime "public source control systems were not, and still are not, capable of handling." Mercurial lineage, not Git fork — started as Mercurial extension, diverged into its own storage / wire / algorithms; Git interop at the client-surface layer only. Key scale primitives: Segmented Changelog (megabyte-scale commit-graph index + O(log n) bisection for
log/blame, "even in Git repositories"); lazy history download (clone ≈ free, fetch on demand); virtual file system (not yet open-sourced); sparse checkout + organization-owned sparse profiles checked into the repo (thousands of engineers on shifting subsets with zero per-engineer config); Watchman forsl statusscan-avoidance. Key UX primitives: smartlog (default view,slinvokes it); undo subsystem (sl undo -iinteractive scroller,sl hide/unhide, "never again should you have to delete your repository and clone again"); first-class commit-stack workflow built on mutation history tracking (inspired by Mercurial Evolve) —sl restack,sl absorb,sl amend --to COMMIT,sl fold,sl split. Companion system: ReviewStack — demo stack-oriented code-review UI for GitHub pull requests at reviewstack.dev. Open-source scope: client only; the Sapling-compatible server (Rust), VFS, Commit Cloud, and per-file history graphs are "we hope to open-source" — no commitment. Canonical wiki counterpoint to Dropbox's 87 GB GHEC monorepo (where tuned server-side Git repack is still viable) — at Meta's scale, a dedicated VCS is the only path. - 2024-08-31 — How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale (first-party Meta Engineering; PEPR 2024 companion) — Meta's technical explainer of Privacy Aware Infrastructure (PAI) — the initiative embedding first-class privacy constructs into Meta's software stack — and its anchor technology Policy Zones, an information flow control (IFC)-based runtime enforcement of purpose limitation integrated across HHVM (web/middle/backend) + Presto + Spark. Canonical wiki canonicalisation of IFC-at-hyperscale: rejects point-checking + ACLs + data lineage combo as insufficient for scaling across "dozens of our systems" and "millions of data assets"; references Denning's 1976 lattice model as the theoretical foundation; annotations on assets at table/column/row/cell granularity or parameter/variable/return-value granularity; logging → enforcement rollout via Policy Zone Manager (PZM) four-step workflow (identify assets → discover flows → remediate violations → continuously enforce/monitor); 10× improvements in computational efficiency disclosed through lattice-representation simplification + language-level context-propagation features in Hack / C++ / Python. Five named adoption lessons including focus on one end-to-end use case first and separate annotation from requirement. Meta's ML-based data classifier named as Step-1 auto-discovery input to PZM. Canonical runtime information flow enforcement + logging-mode-to-enforcement-mode rollout pattern instances.
- 2024-08-23 — Leveraging AI for efficient incident response (first-party Meta Engineering) — Meta's AI-assisted root-cause analysis (RCA) system for web monorepo incident investigations. Two-stage retrieve-then-rank-LLM: a heuristic retriever (code + directory ownership + runtime code graph) narrows "thousands of changes to a few hundred"; a fine-tuned Llama 2 (7B) ranks via ranking-via-election (B=20, K=5) to produce a top-5 list. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Pipeline: CPT on internal wikis/Q&As/code → mixed SFT with a dedicated RCA-SFT dataset of ~5,000 instruction-tuning examples (2-20 candidates each) → a second SFT round producing logprob-rankable ordered lists. Safety discipline named explicitly: closed feedback loops + explainability + confidence thresholding ("sacrificing reach in favor of precision"). Positioned as successor to Hawkeye (December 2023, ML-workflow debugging); future work: autonomous full-workflow execution + pre-push incident prediction.
- 2024-08-05 — DCPerf: An open source benchmark suite for hyperscale compute applications (first-party Meta Engineering) — Meta open-sources DCPerf, a benchmark suite where each benchmark is anchored to a real Meta production application and validated against production at microarchitectural level (IPC + core-frequency comparison graphs) vs SPEC CPU. Canonical statement of hyperscale compute workload as a distinct market segment requiring its own benchmark suite; canonical wiki instance of patterns/workload-representative-benchmark-from-production and patterns/pre-silicon-validation-partnership (two-year vendor-collaboration on pre-silicon / early-silicon bring-up, multiple core-microarchitecture + SoC-power-management optimizations identified). x86 + ARM, chiplet-architecture evaluation, multi-tenancy for rising core counts.
- 2024-06-16 — Maintaining large-scale AI capacity at Meta (first-party Meta Engineering) — how Meta patches, upgrades, and verifies the GPU training fleet (target: 600,000 GPUs) while guaranteeing "all capacity minus one maintenance domain" up 24/7. Names >30 maintenance operations, >50 component classes, 3 host-verification tasks, "thousands of disruptive AI host tasks per day", and OpsPlanner — Meta's unified disruptive-work orchestrator at ~1M ops/day. Introduces the maintenance train + maintenance domain + two-layer sliding upgrade architecture.
- 2024-06-13 — MLow: Meta's low bitrate audio codec (first-party Meta Engineering) — Meta's new RTC audio codec: CELP + split-band + range-encoded, 2× POLQA MOS over Opus at 6 kbps with 10% lower compute, enabling inband FEC at 14 kbps where Opus needs ≥19 kbps. Shipped on Instagram + Messenger + (rolling) WhatsApp. Canonical Meta example of the classical-DSP-over-ML-on-compute-constrained-targets decision: Encodec was available but >20% of Meta calls are on ARMv7 and 10s of millions of daily WhatsApp calls are on 10+-year-old devices.
- 2024-08-05 — A RoCE network for distributed AI training at scale (first-party Meta Engineering; SIGCOMM 2024 paper summary) — engineering deep-dive on the RoCE fabric underneath the 24K-GPU RoCE GenAI cluster, supporting Llama 3.1 405B training. Introduces the AI Zone two-stage Clos template (RTSW + CTSW + ATSW aggregator), the full routing evolution (baseline ECMP → concepts/path-pinning → E-ECMP + QP scaling with +40% AllReduce), and the counterintuitive congestion-control end state: DCQCN off at 400G for a year+, PFC + NCCL receiver-driven admission instead. Canonical wiki instance of patterns/collective-library-transport-codesign.
- 2024-06-12 — How Meta trains large language models at scale (first-party Meta Engineering) — the canonical Meta statement on the Llama 3 training substrate: two 24,000-GPU H100 clusters built in parallel on RoCE + InfiniBand, modified Grand Teton at 700 W air-cooled, with three stack-level network optimisations and an enumerated GPU-failure taxonomy.
- 2023-07-16 — Lessons Learned Running Presto at Meta Scale (republished on High Scalability; Meta-authored)
- 2015-06-26 — Fighting spam with Haskell (first-party Meta Engineering; Simon Marlow, GHC core developer) — the earliest-dated Meta Engineering post currently on the wiki. Describes Meta's two-year rewrite of Sigma — the anti-abuse rule engine proactively detecting spam / phishing / malware across Facebook — from the in-house DSL FXL to Haskell. Post-rewrite throughput: >1M rps. Canonical Meta disclosure of: (1) the five language-selection criteria for a policy-authoring language — purely functional + strongly typed, automatic fetch batching + overlapping, minutes-to-production deploys, C++-competitive performance, interactive development; (2) Haxl — Meta's open-source Haskell framework for implicit concurrent data fetching (ICFP 2014 paper "There is no fork"; github.com/facebook/Haxl); (3) GHC upstream contributions including Applicative do-notation, per-thread allocation limits, heap-management changes (Meta ≥ 64 MB allocation area per core), garbage-collector changes to detect unreferenced old code for safe hot-swap unload, a finalizer fix, and a GC crash fix for a bug "gone undetected in GHC for several years" exposed by Meta's workload; (4) Haskell-sandwiched-between-two-layers-of-C++ integration pattern — mature C++ Thrift server on top, existing C++ service-client libraries below wrapped as Haxl data sources via FFI (with a compile-time C++ name-demangler to avoid intermediate C shims); (5) "source code in the repo is the code running in Sigma" operational posture, with type-correct-at-ingress as the safety gate; (6) Stackage as a curated-package-set response to Cabal/Hackage version-dependency yak-shaves. Performance outcome: 20–30% overall throughput improvement over FXL; up to 3× on specific request types. This post is the language/runtime axis of Meta's stack, orthogonal to the 2023–2024 data-warehouse / GenAI-training / privacy / source-control / fleet-maintenance corpus already ingested.
Key systems surfaced via Meta¶
PQC migration framework (2026-04-16 PQC migration-strategy post)¶
- systems/ml-kem — new canonical system page. NIST FIPS 203 module-lattice PQ KEM (Kyber). Meta's recommended PQ KEM: ML-KEM-768 default, ML-KEM-512 exception (per NIST PQC FAQ endorsement) for performance-constrained cases.
- systems/ml-dsa-signature (extended) — NIST FIPS 204 module-lattice PQ signature (Dilithium). Meta's recommended PQ signature: ML-DSA-65 default, ML-DSA-44 exception. Meta positions ML-DSA as preferable to SPHINCS+ (larger signatures) and Falcon (floating-point arithmetic requirement).
- systems/hqc — new canonical system page. NIST-selected 2025 (fifth PQC algorithm). Code-based KEM; non-lattice mathematical foundation. Meta cryptographers co-authored HQC (plus BIKE and Classical McEliece). Role: algorithmic-diversity hedge against potential weaknesses in ML-KEM's lattice approach.
- systems/liboqs — new canonical system page. Open Quantum Safe library under Linux Foundation PQCA. Meta is a PQCA member and contributes (supports + fixes bugs including issue #1548).
- systems/fbcrypto (extended) — re-positioned as the library PQC migration flows through — the substrate crypto-inventory's automated-discovery side instruments, and the surface that receives the PQC guardrails (guideline updates + key-generation friction + build-system API rules via Buck).
- systems/buck2 — extended with the crypto-API-guardrails use case: build-system policy-enforcement point warning teams during code review on RSA / ECDH API usage.
PQC migration framework patterns + concepts (2026-04-16 PQC migration-strategy post)¶
- concepts/pqc-migration-levels — new canonical concept. Five-rung maturity ladder (PQ-Unaware → PQ-Aware → PQ-Ready → PQ-Hardened → PQ-Enabled) organised around time-to-react-to-quantum-event. Even PQ-Ready is valuable — it reduces reaction time without yet enabling protection. PQ-Hardened exists for use cases where literature gaps (e.g. efficient PQ-OPRFs) prevent full enablement today.
- concepts/pqc-prioritization-framework — new canonical concept. Three-tier classification by attack class (offline / online / Grover-only) rather than asset value. High (offline-attackable via Shor on public-key encryption + key exchange) split by external-dependency status; Medium (online-attackable via Shor on signatures) split by patching capability; Low (Grover-only on symmetric with inadequate parameters).
- concepts/time-to-react-to-quantum-event — new canonical concept. The urgency metric organising the PQC Migration Levels ladder. Each rung permanently reduces the time needed to respond to a "relevant quantum event" (CRQC breakthrough / standards publication / new attack).
- concepts/crypto-inventory — new canonical concept. The organisation-wide mapping of where cryptographic primitives are used. Prerequisite for any migration. Meta's two-strategy approach: automated discovery + developer reporting.
- concepts/hybrid-vs-replacement-pqc-deployment — new canonical concept. The deployment-path decision axis. Meta's position: hybrid by default during transition, because SIKE's 2022 cryptanalytic invalidation demonstrates newer-standards risk.
- patterns/pqc-migration-ladder — new canonical pattern. Structure the PQC migration programme as a laddered set of reachable milestones with independently-budgetable rungs.
- patterns/crypto-api-guardrails — new canonical pattern. Three-layer prevent-new-vulnerable-usages discipline: (1) updated crypto guidelines; (2) friction on key-generation tooling; (3) build-system rules warning on RSA / ECDH API use during code review.
- patterns/automated-discovery-plus-developer-reporting — new canonical pattern. Inventory-building by combining two complementary mechanisms with disjoint failure modes: runtime monitoring (captures active usage in primary libraries) + developer reporting (captures shadow dependencies, new architectures, intent).
- patterns/third-party-dependency-quantum-assessment (extended) — Meta's three-class external-dependency enumeration (community-vetted standards / hardware / production implementations) + the contribute-upstream-to-unblock-your-own-migration consumer posture.
- concepts/post-quantum-cryptography (extended) — the migration-framework companion to the 2024-12-02 inventory framing. Canonical Meta statement of migration-strategy principles (Effectiveness / Timeliness / Performance / Cost Efficiency).
- concepts/cryptographic-monitoring (extended) — explicit "Crypto Visibility" positioning as the automated-discovery leg of the complementary-inventory pattern.
Fork modernization + dual-stack WebRTC (2026-04-09 WebRTC-escape-the-fork post)¶
- systems/meta-webrtc-shim — Meta's dual-stack proxy library sitting between application code and two coexisting WebRTC implementations (legacy + upstream), exposing a single version-agnostic API and dispatching per-call via a runtime flavor enum. Statically linked into every app binary so 50+ RTC use cases (Messenger, Instagram, Cloud Gaming, Meta Quest VR casting) can A/B test each new upstream release against the legacy baseline. Load-bearing techniques: scripted namespace rewriting (
webrtc::→webrtc_latest::/webrtc_legacy::) resolving thousands of ODR collisions; bulkusingdeclarations for backward compatibility; AST-based codegen (1/day → 3–4/day); Buck build-graph duplication for injected components. 10,000+ new lines of shim code; hundreds of thousands of lines modified across thousands of files; no major issues. Binary-size cost: 5 MB uncompressed at the WebRTC layer vs 38 MB at the call-orchestration layer (87% reduction from layer choice). Outcome: launched webrtc/latest at M120, now at M145 ("living at head"). - systems/libwebrtc — the upstream Google/Chromium-maintained WebRTC library Meta used to fork. Each Chromium release has an anchor Git tag (M143 = tag 7499, M145 ≈ 7559). Meta is now "living at head," ingesting new upstream releases continuously.
- systems/meta-quest — one of the 50+ RTC use cases migrated onto the shim; the VR-casting surface. Named here with the other 49+ surfaces as an illustration of the shim's reach across Meta's client surfaces.
- systems/buck2 — Meta's open-source build system; load-bearing for the WebRTC shim's injected-component duplication path ("dynamically changing namespaces at build time, duplicating the high-level build target, and exposing symbols for both flavors through a single header"). The alternative to source-level shimming when components plug deeply into WebRTC internals.
- systems/chromium-git — the Chromium Git + tooling context libwebrtc lives in. Meta's external patch-tracking Git repo is based directly on libwebrtc's tree so Meta can reuse the Chromium tools (
gn,gclient,git cl) without building an internal parallel.
Fork modernization patterns + concepts (2026-04-09 WebRTC post)¶
- patterns/shim-for-dual-stack-ab-testing — new canonical pattern. Interpose a shim at the lowest practical layer to statically link two versions of a library in one binary, dispatch per-call at runtime, A/B test the upgrade across consumers. The load-bearing architectural move of Solution 1.
- patterns/ast-codegen-for-boilerplate-shim — new canonical pattern. AST-parse the library's headers to emit adapter/converter/unit-test scaffolding for each class/struct/enum/constant. Velocity lift: 1 shim/day → 3–4 shims/day. Extends concepts/abstract-syntax-tree into the C++ codegen axis distinct from the query-language axis.
- patterns/bulk-namespace-import-for-backcompat — new canonical pattern. Instead of hand-forward-declaring every symbol from
webrtc_latest::back intowebrtc::, use bulk C++usingdeclarations: zero binary-size cost, new symbols automatic, single-flavor builds naturally supported, external engineers write unchanged code. - patterns/external-feature-branch-repo-for-monorepo-patches — new canonical pattern. Resolution to Solution 2's constraint: monorepo lacks branching, so maintain OSS patches as tag-anchored Git feature branches in a separate repo based on the upstream's own tree. Four named benefits: parallelizable, preserves Git history, LLM-friendly for auto-conflict-resolution, submit-ready upstream.
- patterns/fork-retirement-via-ab-test — new canonical pattern. The migration strategy enabled by the shim: A/B test each use case app-by-app, ship, delete the legacy code, keep the shim for ongoing upstream-release A/Bs. Sibling to patterns/upstream-the-fix for cases where upstreaming the fork's features isn't possible immediately.
- concepts/internal-fork-divergence — new canonical concept. The "forking trap": internal fork + upstream evolution + local patches → merge cost becomes prohibitive. The failure mode the escape-the-fork architecture is the escape hatch from.
- concepts/shim-layer — new canonical concept. A thin proxy between consumers and one or more underlying implementations. Where you shim has order-of-magnitude cost consequences (5 MB vs 38 MB in Meta's case).
- concepts/odr-violation — new canonical concept. C++ One-Definition-Rule violation. Forced by statically linking two copies of a library. Fixed by namespace renaming.
- concepts/symbol-renamespacing — new canonical concept. Scripted rewrite of every namespace in a library copy to a flavor-specific prefix so both copies can coexist. The mechanical enabler of dual-stack.
- concepts/feature-branch-patch-management — new canonical concept. Tracking patches against upstream OSS as Git feature branches per upstream tag, rather than stored patch files.
- concepts/runtime-flavor-dispatch — new canonical concept. Template-specialization-based dispatch between implementations chosen by a runtime flag. The code-organization pattern that makes a dual-stack shim maintainable.
- patterns/upstream-the-fix (extended) — ninth canonical instance and the dual-stack-A/B-harness variant, orthogonal to the seventh-instance FFmpeg "upstream the features, then delete the fork" shape. Both end at fork retirement via different mechanisms.
- concepts/binary-size-bloat (extended) — the shim-layer-placement variant: Meta's 5 MB vs 38 MB datum canonicalises shim-layer-selection as a binary-size decision.
- concepts/abstract-syntax-tree (extended) — the C++-codegen variant: AST of library headers → baseline adapter/converter/test scaffolding.
- concepts/monorepo (extended) — the missing-branches-force-external-repo variant: Meta's monorepo constraint is what drives Solution 2's external Git repo.
Capacity efficiency AI platform (2026-04-16 Capacity Efficiency post)¶
- systems/meta-capacity-efficiency-platform — Meta's unified AI-agent platform for hyperscale performance engineering. Two layers: MCP Tools (profiling queries · experiment results · configuration history · code search · documentation) + Skills (senior-engineer reasoning patterns encoded as domain-expertise modules telling the LLM which tools to use + how to interpret results). Same tools across offense + defense; skills differ per use case. Compounding platform-leverage: "each new capability requires few to no new data integrations since they can just compose existing tools with new skills." Program outcomes: hundreds of megawatts of power recovered program-wide; ~10-hour manual investigations compressed to ~30 minutes (~20×). Canonical wiki instance of patterns/mcp-tools-plus-skills-unified-platform.
- systems/fbdetect — Meta's in-house performance-regression-detection tool (SOSP 2024). Catches regressions as small as 0.005% in noisy production environments; surfaces "thousands of regressions weekly"; correlates each regression to a root-cause PR using traditional techniques. The defense-side detector the AI Regression Solver acts on. First wiki ingest; SOSP 2024 paper at tangchq74.github.io/FBDetect-SOSP24.pdf.
- systems/meta-ai-regression-solver — Meta's defensive AI agent producing fix-forward PRs for FBDetect-detected regressions, sent to the original root-cause PR author for review. Replaces the rollback-vs-ignore binary with a third option: automated mitigation. Three-phase pipeline: context (regressed functions + root-cause PR + changed files) → skill (e.g. logging → sampling) → resolution (PR to root-cause author).
- systems/model-context-protocol (extended) — Meta's tool layer speaks MCP. Adds the hyperscaler-internal-infrastructure-tools canonical instance to the MCP corpus alongside the existing SaaS-facing instances (Datadog, Dropbox Dash, Cloudflare, Fly.io).
Capacity efficiency patterns + concepts (2026-04-16 Capacity Efficiency post)¶
- patterns/mcp-tools-plus-skills-unified-platform — new canonical pattern. Two-layer platform: shared MCP tool surface + pluggable skill bundles + per-use-case agents. Same tools across specialties; specialties differ only in skills. Tool-reuse amortizes data-integration cost; skill authoring becomes the senior-engineer leverage mechanism. Adding a new AI use case = writing a new skill bundle, not a new pipeline.
- patterns/ai-generated-fix-forward-pr — new canonical pattern. Detector attributes regression to root-cause PR → AI agent generates mitigation PR → routes to root-cause PR author for review. Dominates both baselines (rollback pays velocity; ignore pays capacity) when the skill catalogue covers the regression class. Canonical instance: Meta AI Regression Solver on top of FBDetect.
- patterns/opportunity-to-pr-ai-pipeline, systems/ml-kem, systems/hqc, systems/liboqs, concepts/pqc-migration-levels, concepts/pqc-prioritization-framework, concepts/time-to-react-to-quantum-event, concepts/hybrid-vs-replacement-pqc-deployment, concepts/crypto-inventory, patterns/pqc-migration-ladder, patterns/crypto-api-guardrails, patterns/automated-discovery-plus-developer-reporting, patterns/third-party-dependency-quantum-assessment, systems/ml-dsa-signature — new canonical pattern. Offense sibling to ai-generated-fix-forward-pr. Opportunity library + pattern docs + examples + validation criteria → AI agent generates candidate fix with guardrails (syntax/style/right-issue) → lands in engineer's editor for one-click apply. Compresses per-candidate investigation from hours to review-minutes; handles the long tail that engineers would never get to manually.
- patterns/specialized-agent-decomposition (extended) — fifth framing: skill-over-shared-tools composition. Agents specialise by skill bundle rather than by subject-matter domain (Storex) / sub-tool complexity (Dash) / refinement-loop role (DS-STAR) / offline-pipeline stage (Pre-Compute Engine). Meta's platform carries ≥7 specialist agents over the same tool layer.
- patterns/closed-feedback-loop-ai-features (extended) — fourth Meta-domain instance after RCA + Kotlinator + Friend Bubbles. Fix-forward PR routing to root-cause author + offense candidate in engineer's editor are both human-in-the-loop closures; author accountability is preserved.
- concepts/capacity-efficiency — new canonical concept. The discipline of reducing compute/power/capacity demand per unit of product value at hyperscale. Meta's program-level frame — at 3B+ user scale, 0.1% regressions cost significant power. Human-engineer-time is the named bottleneck; AI that multiplies per-engineer throughput is the lever.
- concepts/offense-defense-performance-engineering — new canonical concept. The two-sided frame: proactively find optimizations (offense) + catch regressions (defense). Both problems share the same data shape — same tool layer, different skills — which is what made the unified platform economical. Generalisable to reliability / security / cost.
- concepts/encoded-domain-expertise — new canonical concept. Meta's "skills" primitive: senior-engineer reasoning patterns expressed as markdown modules telling the LLM which tools to invoke + how to interpret results. Model-agnostic (markdown not embeddings); sibling to compass-shape context files from the 2026-04-06 Pre-Compute Engine post. Compass-shape is descriptive; skills are prescriptive.
- concepts/context-engineering (extended) — third Meta 2026 instance alongside Pre-Compute Engine (offline compass-shape files) and implicit retrieval-augmented instances. Meta's Capacity Efficiency Platform is the runtime-composed skills + tools form; pairs with the offline-preloaded compass-shape form to cover both prescriptive-and-descriptive encoded-knowledge axes.
AI-agent pre-compute engine + tribal-knowledge extraction (2026-04-06 AI-pre-compute post)¶
- systems/meta-ai-precompute-engine — Meta's Data Platform internal production system that produces and maintains a navigable tribal-knowledge layer over a 4-repo / 3-language / 4,100-file config-as-code pipeline. Three parts: (1) pre-compute swarm — 50+ specialised agents in one session producing 59 compass-shaped context files (25-35 lines / ~1,000 tokens each, 4 mandated sections); (2) runtime surface — the 59 files + cross-repo dependency index (30× compression on "what depends on X?") + data-flow maps + NL orchestration layer; (3) self-maintenance loop running "every few weeks" enforcing zero hallucinated file paths. Preliminary results on 6 tasks: 40% fewer tool calls per task, ~2 days → ~30 min workflow-guidance cycles, critic scores 3.65 → 4.20 / 5.0 across 3 rounds, 0 hallucinated paths. Model-agnostic — markdown not embeddings. Reuses Meta's operational-AI lineage (Meta RCA 2024-08-23) for the "is the pipeline healthy?" routing path via 85+ historical incident patterns.
AI-agent pre-compute engine + tribal-knowledge patterns + concepts (2026-04-06 AI-pre-compute post)¶
- patterns/precomputed-agent-context-files — new canonical pattern. Extract module-level knowledge (purpose / modification patterns / build-failure patterns / deps / tribal knowledge) into compass-shaped markdown files via a one-session multi-agent orchestration pass; consume them at runtime opt-in per task. Pre-training-overlap-sensitive — helps on proprietary tribal-knowledge-heavy code, hurts on codebases LLMs already saw (2025 academic research on Django/matplotlib). Extraction pays once; consumption pays many times.
- patterns/multi-round-critic-quality-gate — new canonical pattern. Gate AI-generated durable artifacts behind 3 rounds of independent critic-agent scoring + fixer-agent corrections with hard invariants (zero hallucinated paths) at the final round. Distinct from runtime concepts/llm-as-judge by applying pre-release to durable content with a separate fixer stage between rounds. Meta: 10+ critic passes × 3 rounds + 4 fixers took scored quality from 3.65 → 4.20 / 5.0.
- patterns/five-questions-knowledge-extraction — new canonical pattern. Per-module analyst methodology: (1) what does this configure? (2) common modification patterns? (3) non-obvious build-failure patterns? (4) cross-module deps? (5) tribal knowledge in comments? Failure-first shape — questions 3 and 5 target silent-wrong-output mitigations. Q5 produced 50+ non-obvious patterns at Meta. Maps 1:1 to the four compass-shape sections of the output file.
- patterns/self-maintaining-context-layer — new canonical pattern. Four-step automated refresh loop (validate paths / detect coverage gaps / re-run critics / auto-fix stale references) at a bounded cadence (Meta: "every few weeks"). Operational answer to concepts/context-file-freshness; applicable to runbooks / migration guides / API changelogs beyond context files.
- concepts/tribal-knowledge — new canonical concept. Undocumented domain-specific conventions, invariants, and failure modes that live only in engineers' heads and scattered code comments. Pretraining-overlap-asymmetry argument: the knowledge LLMs don't have is precisely the knowledge that matters on proprietary codebases.
- concepts/compass-not-encyclopedia — new canonical concept. Per-module context files at 25-35 lines / ~1,000 tokens with 4 mandated sections (Quick Commands / Key Files / Non-Obvious Patterns / See Also). Design principle: "No fluff, every line earns its place." Token-budget discipline at the per-file level; composes with context-engineering at the per-turn level.
- concepts/config-as-code-pipeline — new canonical concept. Workload class whose behaviour is primarily driven by code-versioned config files across multiple subsystems; the highest-yield workload for the precompute-context-files pattern. Meta's example: 4 repos / 3 languages / 6 subsystems that synchronise on every data-field change.
- concepts/context-file-freshness — new canonical concept. "Stale context is worse than no context" — stale context makes agents confidently wrong where no context makes them slowly explore. Asymmetry argument + automated-refresh operational discipline.
- concepts/context-engineering (extended) — canonical offline-preloading instance added alongside the existing retrieval-augmented (Instacart) and runtime-budget (Fly.io / Dropbox / Datadog) instances.
- patterns/specialized-agent-decomposition (extended) — fourth framing added: the offline-context-generation framing with 9 pipeline-stage roles. Sits alongside Storex (domain-based), Dash (sub-tool), DS-STAR (role-in-refinement-loop).
LLM-scale ads ranking (2026-03-31 Adaptive Ranking Model post)¶
- systems/meta-adaptive-ranking-model — Meta Ads's LLM-scale ads-ranking serving stack. Three pillars: (1) request-centric computation (heavy user tower once per request, shared across ad candidates via in-kernel broadcast; long user sequences in centralised KV store joined to training data on the fly); (2) model-system co-design (selective FP8 guided by micro-benchmark, operator fusion, Grouped GEMM, horizontal fusion — 35% MFU across multiple hardware types); (3) multi-card embedding sharding (O(1T) parameter scale, parity with single-card setups). Launched Instagram Q4 2025 (+3% conversions, +5% CTR for targeted users); O(10 GFLOPs)/token complexity at O(100 ms) bounded latency. Accelerated model loading in <10 min via multi-stream + remote caching; SM-utilisation-based auto-scaling.
- systems/wukong-turbo — optimised runtime evolution of Meta's internal Wukong ranking architecture used inside Adaptive Ranking Model. Adds no-bias for numerical stability, small-parameter delegation from FSDP to DDP to cut all-gather overhead, and sparsity-based linear-layer simplification — without increasing FLOPs or parameter counts.
- systems/wukong-meta — stub for the 2024 Wukong paper (arXiv:2403.02545) architecture Wukong Turbo builds on: stackable factorisation machines, sequence learning, cross-layer attention.
- systems/meta-instagram — extended with Q4 2025 Adaptive Ranking Model launch as the only named product deployment surface in the post (+3% conversions, +5% CTR for targeted users).
LLM-scale ads ranking patterns + concepts (2026-03-31 Adaptive Ranking Model post)¶
- concepts/inference-trilemma-recsys — three-way tension between model complexity, sub-second latency, and cost efficiency at LLM-scale recsys serving. Meta's explicit design frame for Adaptive Ranking Model.
- concepts/request-oriented-computation-sharing — heavy user-context computed once per request, broadcast to candidates in-kernel; transforms scaling from linear to sub-linear in candidate count. Extended by in-kernel broadcast for zero extra HBM traffic.
- concepts/request-oriented-sequence-scaling — long user behaviour sequences processed once per request; centralised KV store of user logs joined with training data on the fly replaces per-row replication.
- concepts/selective-fp8-quantization — post-training quantisation applying FP8 only to micro-benchmark-verified precision-tolerant layers; the alternative to quality-destroying blanket FP8 casts for ranking-sensitive domains.
- concepts/multi-card-embedding-sharding — architectural primitive for embedding tables exceeding single-GPU memory; split across hardware-aware cluster with communication optimisations, achieving parity with single-card setups.
- concepts/unified-embeddings — multiple features share one embedding table, reducing memory footprint.
- concepts/hash-collision-embedding-tradeoff — core tension in embedding-table sizing (oversize overfits, undersize collides); motivates sparsity-aware allocation + pruning + unified embeddings.
- concepts/hardware-aware-model-architecture — model-design discipline aligning structure with accelerator capabilities (dtype, memory hierarchy, Tensor Core shapes, kernel-launch overhead). Canonical wiki statement tied to 35% MFU outcome.
- concepts/model-flops-utilization — extended with Meta's 35% MFU data point across heterogeneous hardware in recsys serving.
- patterns/request-centric-inference-architecture — the overall pattern; restructures inference from per-(user, candidate) to per-request.
- patterns/model-system-codesign-ranking — the four co-design levers (selective FP8, operator fusion for shared inputs, Grouped GEMM, horizontal fusion) that drive MFU.
- patterns/multi-card-sharded-embedding-serving — serving-layer pattern for terabyte-scale embedding tables; decouples parameter count from single-GPU memory ceilings.
- patterns/selective-mixed-precision-quantization — generalised wiki pattern for per-layer micro-benchmark-guided precision assignment; Adaptive Ranking Model is the canonical ranking-domain instance.
Foundational allocator stewardship (2026-03-02 jemalloc post)¶
- systems/jemalloc — Meta's tier-0 foundational memory allocator, maintained upstream. Originally by Jason Evans (FreeBSD, 2005); Facebook-era standard allocator. GitHub repository was archived in 2024 during a stewardship drift; unarchived in 2026-03-02 alongside Meta's public reset. 2026 roadmap: technical debt reduction, HPA improvements for THP CPU efficiency, memory efficiency (packing/caching/purging), AArch64 out-of-box performance. Positioned "alongside the Linux kernel and the compilers" as the foundational-software class requiring "the highest rigor." Previously on the wiki only as the memory-profiler backend of Strobelight — this post promotes it to a first-class Meta page.
Video transcoding infrastructure (2026-03-09 FFmpeg post)¶
- systems/ffmpeg — the open-source multimedia CLI that Meta runs tens of billions of times per day for DASH VOD + livestream transcoding. Meta co-developed upstream with FFlabs and VideoLAN the two features that let Meta fully retire its internal FFmpeg fork: threaded multi-lane transcoding (FFmpeg 6.0 → 8.0, "the most complex refactoring of FFmpeg in decades") and in-loop decoding (FFmpeg 7.0+) for real-time per-lane quality metrics. Both upstreaming efforts spanned years and multiple releases.
- systems/ffprobe — FFmpeg's companion media-inspection CLI. Meta runs it at the same invocation scale as
ffmpeg. - systems/meta-msvp — Meta Scalable Video Processor, Meta's custom video-transcoding ASIC. Integrated into Meta's FFmpeg via the same standard hardware-accel API that exposes NVDEC/NVENC/UVD/QSV. Patches kept internal because MSVP hardware is Meta-only and external FFmpeg developers cannot validate changes without it — canonical wiki instance of patterns/keep-infrastructure-specific-patches-internal. Meta accepts the reverse-rebase cost against each new upstream release.
- systems/nvidia-nvenc-nvdec — NVIDIA's fixed-function hardware encode/decode engines. Pre-existing FFmpeg hardware-accel target; named alongside MSVP as a peer implementation of the shared API.
- systems/intel-quick-sync-video — Intel's iGPU media engine. Pre-existing FFmpeg hardware-accel target; second peer implementation of the shared API named in the post.
Video transcoding patterns + concepts (2026-03-09 FFmpeg post)¶
- concepts/video-transcoding — the general primitive; decode a source + re-encode to one or more target encodings. FFmpeg is the canonical toolchain.
- concepts/adaptive-bitrate-streaming-dash — DASH (Dynamic Adaptive Streaming over HTTP); client dynamically selects among pre-encoded renditions. Forces a multi-lane encoding ladder at production time.
- concepts/multi-lane-encoding-pipeline — architectural shape producing N encoded outputs from one source; the ladder for DASH. Single-process multi-output is the efficient form.
- concepts/in-loop-quality-metrics — reference metrics (PSNR/SSIM/VMAF) computed during transcoding by inserting a decoder after each encoder; the unblock for livestream quality monitoring.
- concepts/visual-quality-metric — the metric category itself. Reference metrics (PSNR/SSIM/VMAF) vs no-reference metrics. Post names PSNR, SSIM, VMAF as the production-relevant set.
- concepts/hardware-accelerated-video-codec-api — FFmpeg's shared abstraction across NVENC/NVDEC/UVD/QSV/MSVP; the architectural primitive that keeps MSVP integration narrowly scoped and reverse-rebasable.
- patterns/deduplicate-decode-across-encoder-lanes — one FFmpeg command, one decoder, N parallel encoders. Canonical Meta-driven upstream win (FFmpeg 6.0 → 8.0).
- patterns/in-loop-decoder-for-realtime-quality-metrics — per-lane reference metrics during encoding, in a single command. Canonical Meta-driven upstream win (FFmpeg 7.0+).
- patterns/keep-infrastructure-specific-patches-internal — the explicit complement to patterns/upstream-the-fix introduced by this post; MSVP is the canonical instance. Together the two patterns form Meta's decision framework for "upstream it / keep it internal".
- patterns/upstream-the-fix — extended: Meta × FFmpeg (6.0 → 8.0, 7.0+) is the seventh canonical instance on the wiki and the highest-stakes outcome to date (multi-year, multi-release, fork retirement as the tangible result).
WhatsApp client-side Rust + media-attack-surface defense (2026-01-27 Rust-at-scale post)¶
- systems/whatsapp-wamedia — WhatsApp's cross-platform media library, rewritten from 160,000 LoC C++ (without tests) → 90,000 LoC Rust (with tests) with performance + runtime-memory advantages. Deployed across Android / iOS / Mac / Web / Wearables; shipped monthly to "billions of phones, laptops, desktops, watches, and browsers" through WhatsApp + Messenger + Instagram. "The largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of."
- systems/whatsapp-kaleidoscope — WhatsApp's ensemble of format-conformance / risk-indicator / file-type-spoof / dangerous-type checks that wamedia enables; the app-layer malware-defense layer that sits in front of OS-provided media parsers the app cannot patch. Canonical defense-in-depth instance on the client-side / media-processing axis.
- systems/whatsapp — host product; 3B+ daily users; default E2EE; the cluster of security/privacy systems on this wiki now spans Private Processing (AI over E2EE via TEE), Advanced Chat Privacy (per-chat controls), wamedia + Kaleidoscope (client-side media malware defense), and the Research Proxy (bug-bounty-facing).
- systems/messenger + systems/meta-instagram — the other two Meta products that ship the Rust-rewritten wamedia monthly. Same 2015-Stagefright forcing function applies; same rollout cadence.
- systems/whatsapp-research-proxy — first canonical wiki instance of a bug-bounty research proxy: Meta publishes a tool "that makes research into WhatsApp's network protocol more effective" to expand the effective security-review headcount via bug-bounty incentive alignment.
Private AI inference on TEE (2025-04-30 WhatsApp Private Processing post)¶
- systems/whatsapp-private-processing — Meta's confidential-computing infrastructure for running AI features over WhatsApp's end-to-end-encrypted messages without Meta, WhatsApp, or any third party ever seeing the plaintext. First use case: message summarisation + writing suggestions. Canonical wiki instance of the TEE-for-private-AI-inference architectural pattern. Stacks five foundational requirements (confidential processing, enforceable guarantees, verifiable transparency, non-targetability, stateless + forward security) + six wire-session phases (anonymous credentials → HPKE key fetch → OHTTP through third-party relay → RA-TLS with attestation-against-ledger → ephemeral E2EE request → response). Announcement voice; launch projected "in the coming weeks" from 2025-04-30; security engineering design paper promised at launch.
- systems/cvm-confidential-virtual-machine — the CPU-based Confidential Virtual Machine + Confidential-Compute-mode GPU primitive Meta builds Private Processing on. Memory encrypted in hardware; host OS + hypervisor removed from the TCB; remote shell prohibited; CVM-to-CVM calls ride the same RA-TLS primitive. Neither CPU vendor (AMD SEV-SNP / Intel TDX / Arm CCA) nor GPU vendor/mode named in the post.
- systems/meta-acs-anonymous-credentials — Meta's Anonymous Credential Service (ACS), open-sourced December 2022, now load-bearing in Private Processing's authentication phase. Mints credentials that prove "authentic WhatsApp client" without identifying the user — the prerequisite that makes OHTTP-relay-based non-targetability actually non-targetable (identity-bearing auth tokens inside the tunnel would defeat IP-stripping at the relay).
- systems/whatsapp-advanced-chat-privacy — the pre-existing WhatsApp feature giving users a per-chat opt-out for AI features. Composes with Private Processing: Advanced Chat Privacy provides chat-granularity refusal; Private Processing provides request-granularity E2EE-preserving infrastructure.
Private AI inference on TEE patterns + concepts (2025-04-30 Private Processing post)¶
- concepts/trusted-execution-environment — the generic TEE primitive class; CVMs are the VM-granularity realisation.
- concepts/confidential-computing — the industry-wide posture of protecting data in use via TEEs; the third leg alongside at-rest + in-transit encryption.
- concepts/remote-attestation — hardware-rooted proof that a specific binary is loaded in a genuine TEE; gated against a published ledger.
- concepts/ra-tls — the TLS composition that binds session-key release to attestation verification.
- concepts/oblivious-http — the two-party-trust-partitioned transport that strips client IP at a third-party relay.
- concepts/hpke — the cryptographic primitive (RFC 9180) underneath OHTTP's inner encryption.
- concepts/non-targetability — first-class security property: attack cost scales with fleet, not victim. Enabled by OHTTP + anonymous credentials + attestation-against-ledger.
- concepts/stateless-processing — service-level discipline making later compromise unable to recover past sessions (there's nothing to find).
- concepts/forward-security — ephemeral-key sibling of statelessness; TEE-non-extractable keys destroyed at session end.
- concepts/verifiable-transparency-log — third-party-operated append-only ledger of acceptable binary digests; turns attestation from audit evidence into enforcement.
- concepts/data-minimization — explicit design axis: requests to Private Processing carry only the data needed for the immediate step (summarisation request = only the messages being summarised).
- concepts/end-to-end-encryption — the invariant Private Processing preserves across a server-side compute step.
- concepts/anonymous-credential — extended with Meta's ACS as the second canonical industrial instance after Cloudflare / Privacy Pass.
- concepts/threat-modeling — extended with the Meta Private-Processing instance: three actor classes, three named scenarios, first canonical wiki instance applied to confidential-computing + private-AI-inference.
- concepts/defense-in-depth — extended with the TEE-substrate + transparency-log axis; canonical wiki instance of defence-in-depth for private-AI-inference where each layer relies on a different trust root (CPU vendor / ledger operator / relay operator / external researchers).
- patterns/tee-for-private-ai-inference — the containing architectural pattern.
- patterns/third-party-ohttp-relay-for-unlinkability — the routing-layer pattern closing the targeted-host attack path.
- patterns/attestation-before-session-key-release — the session-gating realisation (RA-TLS).
- patterns/publish-binary-digest-ledger — the transparency-side companion to attestation.
Flash media tiering + QLC storage (2025-03-04 QLC post)¶
- systems/qlc-flash — Meta's new middle-tier NAND flash (4 bits per cell), positioned between HDD and TLC on the BW/TB / cost / endurance / density axes. Invented 2009; finally data-center-viable at Meta scale via 2 Tb dies + 32-die stacks. Deployed on read-BW-intensive + low-write-BW workloads where endurance matches and R/W asymmetry is manageable.
- systems/tlc-flash — Meta's incumbent data-center flash tier. Continues for write-heavy + latency-sensitive mixed workloads; QLC sits below it, not instead of it.
- systems/pure-storage-directflash-module — Pure Storage DFM + DirectFlash software (userspace FTL). Custom module scalable to 600 TB per drive; physically fits U.2-15mm slots alongside standard NVMe QLC. Canonical Meta × Pure Storage co-design instance on the flash-media axis.
- systems/u2-15mm-form-factor — Meta's chosen QLC form factor: industry-standard, scales to 512 TB, accepts DFMs. Wins over E1.S (too small for QLC package count) and E3 (4-variant fragmentation).
- systems/e1s-form-factor — Meta's incumbent TLC form factor. Great for TLC; rejected for QLC on volume grounds.
Format-aware compression (2025-10-06 OpenZL post)¶
- systems/openzl — OpenZL, Meta's open-source
format-aware lossless
compression framework (released 2025-10-06). Takes structured data
(tabular / columnar / numeric arrays / timeseries / ML tensors /
database pages) + an explicit shape description, trains a
Plan via a budgeted search over
transform choices, resolves the Plan into a concrete per-frame
Resolved Graph embedded
in the frame, and decodes with a single
universal decoder binary. Silesia
saoheadline: x2.06 ratio at 340 MB/s compression + 1200 MB/s decompression, beating xz -9 on both axes and beating zstd -3 on ratio at comparable speed. - systems/zstandard-zstd — Zstandard, Meta's 2016 general-purpose compressor. The lineage OpenZL is the architectural successor to for structured data, the baseline against which OpenZL is measured, the fallback engine when OpenZL can't find structure to exploit, and the original integration target for Managed Compression.
- systems/openzl-sddl — Simple Data Description Language. The declarative way users tell OpenZL what shape their bytes have (rows / columns / enums / nested records). Parser-equivalent alternative is a registered parser function.
- systems/managed-compression-meta — Managed Compression, Meta's internal runtime originally built (2018) to automate zstd dictionary training + rollout. In 2025 extended to OpenZL Plans: register use case → monitor → sample production data → periodically re-train → roll out new Plans "like any other config change." Old frames keep decoding; new frames benefit.
Open AI hardware + OCP 2024 (2024-10-15 OCP AI-hardware-vision post)¶
- systems/catalina-rack — Meta's next-generation AI rack built on the ORv3 HPR chassis (up to 140 kW), liquid-cooled, hosting NVIDIA GB200 Grace Blackwell. Components: power shelf + compute tray + switch tray + ORv3 HPR + Wedge 400 fabric switch + management switch + BBU + rack management controller. Successor to the air-cooled Grand Teton H100 platform; OCP-contributed.
- systems/orv3-rack — the Open Rack v3 high-power-rack variant introduced via Catalina, specified for up to 140 kW per rack — more than an order-of-magnitude beyond typical air-cooled OCP rack power envelopes.
- systems/nvidia-gb200-grace-blackwell — NVIDIA's Blackwell-generation Grace+Blackwell Superchip; the silicon Catalina is designed around.
- systems/amd-instinct-mi300x — AMD's flagship data-center GPU, now supported on Grand Teton via a new OCP-contributed platform variant positioned for "large-scale AI inference workloads."
- systems/meta-dsf-disaggregated-scheduled-fabric — Meta's Disaggregated Scheduled Fabric (DSF): open vendor-agnostic AI networking backend, powered by OCP-SAI + FBOSS + Ethernet-RoCE, with multi-vendor endpoint support (NVIDIA + Broadcom + AMD named). Overcomes scale / component-supply / power-density limits of Meta's prior vertically-integrated switches.
- systems/fboss-meta-network-os — FBOSS, Meta's open-source network operating system (2018) for controlling switches; control-plane substrate for DSF.
- systems/ocp-sai — Switch Abstraction Interface, the vendor-agnostic switch-ASIC API standard Meta + Microsoft co-developed for OCP in 2018; foundational abstraction for DSF.
- systems/fbnic — Meta's first in-house network ASIC (as a NIC module), shared with OCP; silicon response to the projected ~1 TB/s-per-accelerator injection-bandwidth regime.
- systems/mount-diablo-power-rack — Meta × Microsoft disaggregated 400 VDC power rack contributed to OCP; allows more AI accelerators per IT rack.
- systems/meta-wedge-400 — Meta's OCP-contributed fabric switch (2021), a component of the Catalina rack assembly.
- systems/oam-open-accelerator-module — the OCP accelerator-module standard that makes Grand Teton's NVIDIA+AMD multi-vendor support feasible; part of Meta's Meta-Microsoft OCP lineage.
- systems/llama-3-1 — Llama 3.1 405B training disclosed at > 16,000 H100 GPUs; re-anchors the 2024-06 two-24K-GPU-cluster substrate as the predecessor scale.
Privacy + information flow control (2024-08-31 PAI post)¶
- systems/meta-privacy-aware-infrastructure — Meta's Privacy Aware Infrastructure (PAI) umbrella initiative embedding first-class privacy constructs into the software stack. Publicly announced January 2024; technically detailed at PEPR 2024 + the 2024-08-31 Meta Engineering post. PAI is the multi-year engineering + tooling investment; Policy Zones is the enforcement primitive underneath it.
- systems/meta-policy-zones — Meta's information flow control (IFC) runtime technology. Encapsulates + evaluates + propagates privacy constraints for data in transit and at rest at runtime. Integrated across HHVM (function-based: call-tree zone propagation), Presto, and Spark (batch: per-SQL-job zones with table/column/row annotations). Implemented in PAI runtime libraries in Hack / C++ / Python. Performance engineered for "10× improvements in computational efficiency" via simplified policy-lattice representation + native language-level context propagation + canonicalized policy annotation structures. Rollout discipline: logging mode → enforcement mode.
- systems/meta-policy-zone-manager — Policy Zone Manager (PZM), the UX + automation suite that makes Policy Zones adoption tractable across Meta's polyglot multi-thousand-engineer codebase. Four-step workflow: identify assets → discover downstream flows (via lineage) → remediate violations (three cases: safe / unsafe / reclassified) → continuously enforce + monitor via verifiers. The lesson "Build tools; they are required" is directly about PZM: it reduces engineering effort for purpose-limitation rollout "by orders of magnitude."
- systems/meta-data-classifier — Meta's ML-based data classifier (first published 2020-07-21) that automatically identifies sensitive data assets at scale. Named in the 2024-08-31 post as the Step-1 auto-discovery input to PZM: "we heavily rely on various techniques such as our scalable ML-based classifier to automatically identify data assets."
- systems/hhvm — Meta's HipHop Virtual Machine (runs Hack + PHP); Meta's canonical function-based system where Policy Zones is integrated at the call-tree / request-zone granularity. Hack named as one of the three host languages (alongside C++, Python) for PAI runtime libraries.
Incident response + investigation tooling (2024-08-23 RCA post)¶
- systems/meta-rca-system — Meta's AI-assisted root-cause analysis system for web-monorepo investigations. Two-stage architecture: heuristic retriever (code + directory ownership + runtime code-graph) narrows thousands of recent changes to a few hundred; a fine-tuned Llama 2 (7B) ranks via ranking-via-election (B=20, K=5) to a top-5. 42% top-5 accuracy at investigation-creation time on backtested historical investigations. Built with CPT on internal wikis/Q&As/code + mixed SFT with a dedicated 5,000-example RCA-SFT set + a second SFT round teaching logprob-rankable ordered-list output.
- systems/hawkeye-meta — Meta's predecessor AI investigation tool (December 2023) for end-to-end ML-workflow debugging. Stub on this wiki; named as the prior generation in Meta's investigation-tooling lineage.
- systems/llama-2 — Meta's July-2023 open-weight foundation model family. The 7B variant is the base of the RCA ranker; predecessor of systems/llama-3 / systems/llama-3-1.
Hyperscale benchmarking + hardware co-design (2024-08-05 DCPerf post)¶
- systems/dcperf — Meta's open-source benchmark suite for hyperscale compute workloads. Each benchmark anchored to a real Meta production application. Representativeness validated at microarchitectural level (IPC + core-frequency comparison vs production apps and SPEC CPU). Multi-ISA (x86 + ARM), extended for chiplet-based architectures and multi-tenancy / rising core counts. Used internally alongside SPEC CPU. github.com/facebookresearch/DCPerf.
- systems/spec-cpu — the industry-standard incumbent DCPerf supplements. Meta's published IPC + frequency graphs are direct evidence of benchmark bias for hyperscale workloads — SPEC CPU under-represents the microarchitectural behaviour of Meta production applications. DCPerf does not replace SPEC CPU; they're used together.
GenAI training substrate (2024-06-12 post)¶
- systems/grand-teton — Meta's open-sourced AI training server platform, modified in 2024 to host H100 at 700 W TDP with HBM3, retained air cooling because data-center cooling infrastructure could not change in time.
- systems/meta-genai-cluster-roce — one of two 24,000-GPU H100 clusters. Uses RoCE as the inter-node fabric; optimised for fast build time. The largest Llama 3 model was trained on this cluster.
- systems/meta-genai-cluster-infiniband — the sibling 24K-GPU H100 cluster on InfiniBand. Evolved from Meta's 16K-GPU AI Research SuperCluster into a production-integrated build; optimised for full-bisection bandwidth.
- systems/roce-rdma-over-converged-ethernet — the Ethernet-native RDMA fabric Meta scaled 4K → 24K GPUs for production AI. SIGCOMM 2024 paper adds: DCQCN off at 400G, PFC+admission instead; E-ECMP+QP scaling gains 40% on AllReduce.
- systems/ai-zone — Meta's two-stage Clos template (RTSW leaf + CTSW deep-buffer spine + optional ATSW aggregator); non-blocking inside the Zone, oversubscribed across Zones, topology-aware scheduler compensates.
- systems/llama-3 — trained on both 24K-GPU clusters in parallel.
- systems/llama-3-1 — Llama 3.1 (incl. 405B) trained on the same substrate; flagship workload of the 2024-08-05 SIGCOMM paper.
Fleet maintenance + production engineering (2024-06-16 post)¶
- systems/opsplanner — Meta's unified disruptive-work orchestrator; single choke point for every maintenance operation across the fleet. Handles ~1 million operations per day. Owns overlap serialisation, safe drain/return to production, handover flow (avoiding overlaps and deadlocks), pre-return verification, and the planned-maintenance + failure buffers. Used for all Meta capacity — compute and storage — not only AI.
Data warehouse + query + scheduling¶
- systems/presto — distributed SQL query engine, still actively operated at Meta-scale across "tens of thousands of machines" in
- The open-source Presto/Trino split (2020) did not retire Meta's internal deployment.
- systems/meta-presto-gateway — Meta's internal Gateway fronting every Presto query. Hardened with per-dimension throttling and autoscaling after early outages from unintended DDoS-style internal traffic.
- systems/meta-data-warehouse — the multi-datacenter data lakehouse Presto serves; its hardware-provisioning pipeline drives automated Presto cluster standup/decommission.
- systems/tupperware — Meta's container/cluster management system (named as an integration hook for Presto cluster turn-up automation).
- systems/meta-ptp — Precision Time Protocol deployment at Meta, surfaced via the High Scalability Dec-2022 roundup.
- systems/owl-content-distribution — Meta's 800 PB/day centralized-control peer-to-peer content distribution, surfaced via High Scalability.
Real-time communication + audio (2024-06-13 post)¶
- systems/mlow-codec — MLow (Meta Low Bitrate), Meta's classical-DSP RTC audio codec. Built on CELP with split-band encoding and range-coded output. 2× POLQA MOS over Opus at 6 kbps, 10% lower compute, SuperWideBand at low bitrate, inband FEC viable at 14 kbps (Opus floor: 19 kbps). Fully shipped on Instagram + Messenger; rolling out on WhatsApp. End-to-end-encrypted. Development timeline: late 2021 → mid-2024.
- systems/opus-codec — the 2012 open-source codec Meta used for all RTC before MLow. Reference benchmark point. NarrowBand at its 6 kbps floor.
- systems/meta-encodec — Meta's ML-based audio codec (FAIR, October 2022). High quality at low bitrate but "only the very high-end (expensive) mobile handsets are able to run these codecs reliably". The canonical counterexample Meta chose not to ship for RTC in favour of MLow.
Source control + monorepo VCS (2024-09-10 Sapling post)¶
- systems/sapling-scm — Meta's source control system (client open-sourced 2022-11-15, announcement post ingested 2024-09-10). Mercurial-lineage, 10-year internal development, scales to "tens of millions of files, tens of millions of commits, and tens of millions of branches." Not a Git fork, though the open-source client speaks Git. sapling-scm.com.
- systems/sapling-smartlog — the
sldefault view. Load-bearing UX primitive; hides irrelevant history behind a dashed line, shows local commits + relevant remotes. Interactive web UI viasl web. - systems/meta-segmented-changelog — Sapling's commit-graph-shape index, downloaded in "just a few megabytes", enabling O(log n)
log/blamevia segment-graph bisection even on Git-backed repos. Paired with lazy history download for monorepo-scale VCS. - systems/sapling-virtual-fs — Sapling's virtual file system for working-copy scale. Not yet open-sourced as of 2022-11-15. Presents the full repo shape; fetches files lazily on first access; prefetches per-project.
- systems/sapling-server — the Rust-implemented Sapling-compatible server. Not yet open-sourced. Substrate for the server-dependent scale features: lazy history download, per-file history graphs, VFS, Commit Cloud, and future Sapling-served Git repos.
- systems/commit-cloud-meta — Meta's commit-cloud preview: all commits auto-uploaded; sharing via commit hash +
sl goto HASH. Server-dependent, not yet open-sourced. - systems/reviewstack — demo stack-oriented code-review UI for GitHub pull requests at reviewstack.dev. Critique of GitHub's non-stack PR review model.
- systems/mercurial — Sapling's open-source ancestor; Sapling started as a Mercurial extension and credits Mercurial's Evolve extension as direct inspiration for mutation history tracking.
- systems/watchman — Meta's open-source file-system monitor. Used by Sapling's
sl statusto avoid full working-copy scans when the VFS isn't deployed.
Developer tools + code indexing (2024-12-19 Glean post)¶
- systems/glean — Meta's open-source code-indexing system (open-sourced August 2021). Collects + stores + queries structured facts about source code; one shared, centralized, network-queryable index powers code browsing, code search, auto-generated docs, code review, IDE augmentation, dead-code detection, API-migration tracking, test selection, automated data removal, and RAG in AI coding assistants. Storage: RocksDB. Indexing + query service both distributed; databases replicated across query service machines + centrally backed up. Canonical wiki instance of centralized ahead-of-time indexing. glean.software · github.com/facebookincubator/Glean.
- systems/angle-query-language — Glean's declarative logic-based query language (anagram of Glean, "to fish"). Predicates ≈ SQL tables; facts ≈ rows; derived predicates = SQL-view analogue. Schema-level derivation is the mechanism behind language-neutral schema abstraction — language-specific facts underneath, cross-language views projected over them in the schema itself. Prefix-indexed queries over predicate-declared field order; transitive-closure queries (e.g. C++
#includefanout) expressed as fixpoint queries. Published latencies: ~1 ms for name+namespace lookup; few-ms first-results on inheritance-chain queries with incremental streaming of the rest. - systems/glass-symbol-server — Glean's symbol server; one-API-call navigation surface (
documentSymbols(repo, path, revision)) behind which language-specific schemas live. Owns per-language symbol IDs that stay stable under code motion. Drives Meta's code browser (embedded Monaco editor), Phabricator review-time navigation, Find References, Call Hierarchy, and symbol-URL-stable documentation rendering. Open source at glean/glass. - systems/rocksdb — Meta's open-source LSM-tree-based embedded key-value store (rocksdb.org); fact-storage substrate for Glean. Immutable SSTable grain aligns naturally with Glean's stacked-immutable-database layering.
- systems/lsif — Language Server Index Format, the Microsoft-led LSP-ecosystem code-indexing incumbent Glean deliberately generalises past: Glean is not tied to any one language or any one use case, where LSIF's feature set is shaped by LSP operations.
- systems/monaco-editor — Microsoft's VS Code editor as an embeddable component; Meta's internal code browser uses Monaco as its surface, calling Glass for outline + nav rendering.
- systems/phabricator — Meta's code-review tool; review-time accurate go-to-definition + type-on-hover + documentation on the diff itself is driven by Glean diff sketches and Glass APIs, covering C++, Python, PHP, JavaScript, Rust, Erlang, Thrift, Haskell.
Anti-abuse rule engine (2015-06-26 Haskell post)¶
- systems/sigma-meta — Sigma, Meta's in-path rule engine for proactively detecting spam, phishing, and malware on Facebook. Every user interaction is evaluated against a set of policies specific to that interaction type; "bad content detected by Sigma is removed automatically so that it doesn't show up in your News Feed." Post-2015-rewrite throughput: >1M rps. Architecture: Haskell sandwiched between two layers of C++ — C++ Thrift server above; C++ service clients below wrapped as Haxl data sources via FFI. Policies are continuously deployed from the repository; "the source code in the repository is the code running in Sigma."
- systems/haxl — Meta's open-source Haskell framework for implicit concurrent data fetching. Automatically batches calls to the same data source and overlaps calls to distinct data sources without the programmer writing explicit concurrency constructs. GitHub: facebook/Haxl. ICFP 2014 paper: There is no fork: an abstraction for efficient, concurrent, and concise data access.
- systems/ghc — the Glasgow Haskell Compiler + runtime, with Meta-contributed features: Applicative do-notation (the compiler half of Haxl's rearrangement); heap-management changes reducing GC frequency on multicore; per-thread allocation limits via asynchronous exceptions; GC changes to detect unreferenced old code for safe hot-swap unload; a finalizer fix for clean shutdowns; a GC crash fix for a bug "gone undetected in GHC for several years."
- systems/haskell — the language Meta standardised on for Sigma policy authoring. Purely functional + strongly typed + mature optimising compiler (GHC) + interactive environment (GHCi) + rich ecosystem. Canonical wiki anchor.
- systems/stackage — the curated Haskell package set Meta moved to after direct Cabal/Hackage use produced cascading version yak-shaves. Canonical wiki datum on curated-set ecosystems vs. SemVer-resolved ecosystems: author-declared SemVer constraints are not a substitute for an externally-curated compatibility matrix at large-dependency-graph scale.
- systems/fxl-meta — FXL, the retired in-house Facebook DSL Haskell replaced at Sigma. Interpreted (slow); lacked user-defined types + modules. Cautionary datum: in-house DSLs stop paying their cost when complexity growth outruns expressivity and interpreter performance caps hardware utilisation — both conditions together justified migration.
Recommendation + social-discovery systems (2026-03-18 Friend Bubbles post)¶
- systems/facebook-reels — Meta's short-form vertical-video surface inside the Facebook app. The surface hosting Friend Bubbles. Engineered as a performance-sensitive surface with three nonnegotiable client constraints (smooth scrolling, no load-latency regression, low CPU overhead). Has an existing optimised video-prefetch window that new per-video metadata piggybacks on (concepts/prefetch-window-metadata-coattending). "First short-form-video recommendation-surface source" on the wiki.
- systems/meta-friend-bubbles — the recommendation-system component surfacing Reels a viewer's friends interacted with, rendered as tappable avatar bubbles. Three architectural layers: viewer-friend closeness (two complementary ML models — survey-trained weekly-trillions + context-specific bubble-click), retrieval-ranking funnel modifications (friend-interacted content explicitly retrieved + new features/tasks in both early- and late-stage MTML rankers with a conditional-probability objective
P(engage | bubble impression)), and client-side integration (prefetch-window co-attending + conditional animation + conservative prevalence gating). Continuous feedback loop re-trains on bubble-interaction data. "First recommendation-system canonical instance" on the wiki paired with the LLM-ranker sibling systems/meta-rca-system.
Key operational patterns surfaced at Meta¶
Foundational-OSS stewardship patterns (2026-03-02 jemalloc post)¶
- patterns/stewardship-reset-for-foundational-oss — new canonical pattern. When a highly-leveraged foundational OSS project has effectively single-vendor stewardship, the stewarding org's short-term product incentives can corrode long-term engineering principles without external correction signal. Meta's 2026 jemalloc post is the canonical wiki instance: (1) admit the drift publicly, (2) re-engage the community on the record (including founder Jason Evans), (3) restore visible community infrastructure (unarchive the repository), (4) publish a forward-looking technical roadmap. Reset evaluated over years of shipped work, not at announcement time.
- patterns/upstream-the-fix extended — eighth canonical instance, the upstream-steward-itself variant. The first seven instances (Cloudflare × 2, Datadog, Fly.io × 2, Cloudflare arm64, Meta FFmpeg) are downstream-consumer cases: "I found a bug in an ecosystem primitive; I fix it upstream." Meta's jemalloc reset is the dual: "I am the upstream and I have drifted; here is how I return to discipline." Consumer-side and steward-side together form the two-sided discipline of foundational-OSS maintenance.
- concepts/huge-page-allocator — new concept. The allocator subsystem responsible for serving allocations backed by huge pages (2 MiB+) rather than 4 KiB base pages. Targets TLB-miss reduction at hyperscale. jemalloc's HPA is the canonical instance on the wiki.
- concepts/transparent-huge-pages — new concept. Linux kernel feature (since 2.6.38) that transparently promotes contiguous 4 KiB pages to 2 MiB pages without requiring
hugetlbfs. The zero-config path to huge-page CPU efficiency; jemalloc's HPA primarily targets THP.
WhatsApp Rust-at-scale + media-defense patterns (2026-01-27 Rust-at-scale post)¶
- patterns/parallel-rewrite-with-differential-testing — canonical wiki pattern for large C/C++→Rust rewrites of well-specified cross-platform libraries. New implementation runs alongside the old; differential fuzzing enforces parity; staged cutover once parity + memory/performance advantages are demonstrated. Sibling to — but distinct from — patterns/ai-driven-framework-rewrite (Cloudflare vinext; external-API oracle), patterns/rust-replacement-of-dynamic-language-hot-path (Cloudflare FL1→FL2; dynamic-to-static axis), and patterns/language-rewrite-for-concurrency (Dropbox Feast → Go, DSQL JVM → Rust; concurrency-model axis).
- patterns/memory-safe-language-for-untrusted-input — the design rule: any code path that processes untrusted input automatically (no user step) should be written in a memory-safe language. wamedia is the canonical client-side instance; Aurora DSQL extensions + Dropbox Nucleus are the server-side + sync-engine siblings.
- patterns/format-aware-malware-check-before-os-handoff — canonical wiki pattern for Stagefright-class ungovernable-OS-library mitigations. App validates format + spoof + risk + dangerous-type before handing bytes to the OS parser it cannot patch.
- patterns/bug-bounty-research-proxy — vendor publishes a research-facilitating tool (here: the WhatsApp Research Proxy) to lower the barrier for external protocol research while concentrating that research at a controlled endpoint.
- concepts/memory-safety extended — the client-side cross-platform library instance, complementing the server-side (DSQL), managed-runtime (Datadog Go), and sync-engine (Nucleus) instances already on the wiki.
- concepts/parser-differential extended — media-file / OS-library variant of the attack class, plus the canonical app-layer defensive posture ("one parser in front of many ungovernable parsers, reject divergent inputs"). Complements the ruby-saml "one parser for security boundaries" posture.
- concepts/differential-fuzzing — canonical wiki first instance (new concept page).
- concepts/attack-surface-minimization — Meta's first-of-three pillars (verbatim: "Design the product to minimize unnecessary attack surface exposure"), canonicalised as a new wiki concept.
- concepts/os-library-vulnerability-ungovernable — new concept; the architectural forcing function (OS libraries are outside the app's patch authority).
- concepts/patch-lag — new concept; user-side delay between fix release and installed-base update (measured in months for mobile OS libraries).
- concepts/format-conformance-check — new concept; primitive check family inside Kaleidoscope.
- concepts/file-type-spoofing — new concept; sibling check family inside Kaleidoscope.
- concepts/cross-platform-client-library — new concept; wamedia's design axis. Captures the build-system + binary-size tax that cross-platform Rust on mobile pays.
- concepts/defense-in-depth extended — the client-side / media-processing / ungovernable-downstream-parser variant.
- concepts/binary-size-bloat extended — the mobile-distribution-channel-constraint variant. Counter-intuitive source-code result (LoC inverted) paired with the real binary-size tax of Rust's stdlib on mobile.
Flash media tiering + QLC software stack (2025-03-04 QLC post)¶
- patterns/middle-tier-storage-media — canonical Meta instance: insert QLC between HDD and TLC when the lower tier's BW/TB has fallen below workload needs and the upper tier is overpaid for the gap band. Discipline: match target workload shape (read-BW-intensive, batch IO, low-write) to the new media's strengths; accept non-TCO-parity at launch if density + power-efficiency justify it.
- patterns/userspace-ftl-via-io-uring — ublk + io_uring + userspace FTL stack. Pure Storage's DirectFlash software is the canonical 2025 instance; the pattern generalises to any vendor-specific flash management that benefits from host-side visibility + control. For standard NVMe QLC SSDs the stack simplifies to io_uring directly against the NVMe block device.
- patterns/rate-controller-for-asymmetric-media — software-side arbitration for media with asymmetric R/W throughput. QLC's 4×+ read-vs-write gap forces host-side scheduling of writes so latency-sensitive reads don't queue behind them. Composes with userspace-FTL because that's where the scheduler has full pending-write visibility.
- concepts/bandwidth-per-terabyte — the organising axis for Meta's three-tier HDD/QLC/TLC hierarchy. Canonical wiki first instance with explicit bands (HDD ~5-10, QLC 10-20, TLC 50+ MB/s/TB).
- concepts/storage-media-tiering — the general tier-structure concept. Meta's QLC post is the canonical wiki instance of inserting a new media tier via density + endurance + workload-match.
- concepts/qlc-read-write-asymmetry — the media-level property that makes rate-control load-bearing.
- concepts/write-endurance-nand — the historical QLC blocker, now managed via workload-matching.
- patterns/co-design-with-ocp-partners (extended) — Meta × Pure Storage is a new bilateral co-design partnership on flash media, extending the Meta × Microsoft OCP lineage to a third orthogonal hardware subsystem (power → networking → GPU → flash).
Format-aware compression patterns (2025-10-06 OpenZL post)¶
- patterns/offline-train-online-resolve-compression — canonical wiki pattern for format-aware compression: offline trainer consumes shape description + sample data → emits a Plan (possibly a Pareto set) → encoder resolves Plan to concrete Resolved Graph per frame → universal decoder reads Resolved Graph and executes. Training cost amortised over many frames; Plans are first-class config objects.
- patterns/embedded-decode-recipe-in-frame — ship the Resolved Graph inside the frame so each frame is self-contained and any decoder instance can decode it without out-of-band config. The substrate for the universal decoder property.
- patterns/fallback-to-general-purpose-compressor — safety net for inputs with no exploitable structure (pure text / unknown formats). OpenZL's worst case is zstd-equivalent, not worse, because the trainer can always select a Plan that reduces to "just run zstd." Canonical wiki reference. Sibling at higher abstraction level: concepts/llm-cascade (cheap-specialised-first, generalist-as-fallback).
- patterns/graceful-upgrade-via-monoversion-decoder — keep the decoder binary version-stable across Plan evolution. Old Plan frames decode on any decoder; new Plan frames decode on any decoder; a decoder update (SIMD / bounds / scheduling) improves every frame including historic ones. Rolls out new Plans without waiting on consumer fleets.
- concepts/format-aware-compression — the architectural category OpenZL defines. Canonical wiki concept.
- concepts/universal-decoder — the decoder-side invariant the entire OpenZL architecture is organised around. Four enumerated deployment benefits: single audit surface, fleet-wide improvements, operational clarity, continuous training. Canonical wiki concept.
- concepts/compression-plan — the config object that flows through the train→resolve→decode pipeline; replaces the scalar "compression level" knob with a serialised compressor graph.
- concepts/reversible-transform-sequence — the pre-entropy-coding pipeline of reversible transforms applied per the Plan.
- concepts/structure-of-arrays-decomposition — the canonical first transform (AoS → SoA) that makes per-field strategies possible.
- concepts/delta-encoding — transform picked for mostly-sorted numeric streams. Canonical wiki concept.
- concepts/tokenize-transform — transform picked for low-cardinality streams; emits dictionary + index list, each routed to its own subgraph. Canonical wiki concept.
- concepts/runtime-control-point-compression — per-frame branch points inside a Plan; read lightweight statistics, pick a subgraph, record the choice in the frame. Adapts within-Plan without re-training; zero decoder complexity.
Open AI hardware + OCP patterns (2024-10-15 OCP AI-hardware-vision post)¶
- patterns/open-hardware-for-ai-scaling — Meta's thesis: AI scale requires the hardware layer to move at the pace of the software layer, which requires open-source contribution rather than vendor-locked designs. Canonical wiki instance.
- patterns/modular-rack-for-multi-accelerator — Grand Teton extended to NVIDIA H100 + AMD MI300X; Catalina extending the same principle to GB200 rack-scale. Preserved "single monolithic system design with fully integrated power, control, compute, and fabric interfaces" across accelerator vendors.
- patterns/co-design-with-ocp-partners — Meta × Microsoft OCP lineage: SAI (2018) → OAM → Mount Diablo (2024). Bilateral collaboration as the operating model for cross-industry hardware standardization.
- concepts/network-fabric-disaggregation — DSF as canonical wiki instance of the architectural stance of splitting a vertically-integrated fabric into open vendor-replaceable layers (ASIC / NOS / control-API / endpoint protocol / NICs).
- concepts/liquid-cooled-ai-rack — Catalina at 140 kW breaking Meta from the air-cooled constraint that defined Grand-Teton-H100 in 2024-06.
- concepts/injection-bandwidth-ai-cluster — Meta's forward projection: ~1 TB/s per accelerator injection bandwidth (> 10× today's networks).
- concepts/bisection-bandwidth — projected to scale in lockstep with injection bandwidth ("equal normalized").
- concepts/400-vdc-rack-power — Mount Diablo canonical instance of high-voltage DC power delivery at the rack level.
- concepts/rack-level-power-density — extended via Catalina's 140 kW envelope, at the hyperscaler-AI end of the wiki's power-density spectrum (opposite end from Dropbox's ~16 kW air-cooled storage rack).
Privacy + information flow control patterns (2024-08-31 PAI post)¶
- patterns/runtime-information-flow-enforcement — Meta's canonical statement of runtime IFC as the enforcement primitive for privacy constraints at hyperscale. Replaces point-checking controls + data lineage audit-and-ACL combo with runtime label-propagation + flow-rule evaluation integrated directly into HHVM / Presto / Spark. Canonical wiki reference.
- patterns/logging-mode-to-enforcement-mode-rollout — the rollout pattern Policy Zones codifies: detect and record violations in production without blocking, remediate, then flip to enforcement. A correctness-constraint rollout pattern distinct from traffic-axis staged rollout; sibling of patterns/data-driven-allowlist-monitoring-mode on the security axis.
- patterns/end-to-end-use-case-first — Meta's first "lesson learned": ship one concrete end-to-end use case through all integration targets before generalising the platform. Function-based-system design was too abstract for its first real large-scale customer; "refining the APIs and building missing operational support" was what made it work end-to-end. Reusable platform-bring-up discipline.
- patterns/separate-annotation-from-requirement — Meta's fourth "lesson learned": keep data annotations simple (labels only —
BANANA_DATA) and express per-requirement flow rules separately. A monolithic annotation API broke under multi-requirement composition with "data annotation conflicts that were difficult to resolve." Reusable schema-design pattern for IFC / policy systems. - concepts/information-flow-control — the classical primitive (Denning 1976) Meta adopts as its hyperscale privacy-enforcement primitive. First canonical industrial-IFC instance on the wiki.
- concepts/purpose-limitation — the privacy principle PAI currently enforces. "Data is only processed for explicitly stated purposes."
- concepts/point-checking-controls — the approach Meta outgrew: if statements + ACLs at the point of data access, requiring human audits and physical data separation.
- concepts/data-annotation — the metadata-label primitive that makes runtime IFC operational on real codebases and data systems; granularity from table/column/row/cell (batch) to parameter/variable/return-value (function-based).
- concepts/data-flow-violation — the runtime event Policy Zones detects; three remediation cases (safe / unsafe / reclassified) exposed via PZM.
- concepts/logging-vs-enforcement-mode — two-phase enforcement-severity primitive underneath the rollout pattern.
- concepts/policy-lattice — Denning's 1976 lattice model of security labels. Meta's representation + evaluation simplification is one lever for the "10× improvements in computational efficiency."
- concepts/shift-left-privacy — the named engineering stance: privacy enforced at code execution + developer workflow, not at audit.
- concepts/data-lineage — extended framing: discovery primitive (retained inside PZM Step 2) but explicitly rejected as a sufficient enforcement primitive at Meta scale.
Incident-response + AI-assisted RCA patterns (2024-08-23 RCA post)¶
- patterns/retrieve-then-rank-llm — canonical wiki pattern: cheap heuristic retriever narrows the population, LLM ranker produces the top-K. Meta's RCA variant: directory-ownership + runtime-code-graph retrieval → Llama-2 ranking-via-election → top-5. Canonical wiki reference.
- concepts/heuristic-retrieval — stage-1 retrieval via domain-rule-encoded narrowing (ownership metadata + runtime code-graph traversal). Cheap, interpretable, reproducible; depends on the monorepo substrate's structured ownership.
- concepts/llm-based-ranker — the architectural role a fine-tuned LLM plays in the two-stage cascade. Output modes: natural-language top-K (via ranking-via-election) + logprob-ranked list (via a dedicated SFT format).
- concepts/ranking-via-election — the tournament-style prompt-structure primitive for candidate sets larger than context window. Meta's B=20, K=5 configuration collapses ~few-hundred → 5 in O(log N) rounds. Canonical wiki reference.
- concepts/supervised-fine-tuning — the task-teaching stage of Meta's training pipeline. Two-round SFT: mixed-SFT (original SFT data + internal context + RCA SFT dataset) + a second SFT round producing logprob-ranked ordered-list output. First canonical wiki SFT page.
- patterns/closed-feedback-loop-ai-features — Meta's explicit safety discipline for employee-facing AI: reproducibility + explainability + feedback channel. Canonical wiki reference. "Responders can independently reproduce the results generated by our systems to validate their results."
- patterns/confidence-thresholded-ai-output — the precision-over-reach primitive that pairs with the closed-feedback-loop discipline. "Confidence measurement methodologies to detect low confidence answers and avoid recommending them to the users — sacrificing reach in favor of precision." Canonical wiki reference.
- concepts/automated-root-cause-analysis — the capability class. The 2023 Presto-analyzer framing (multi-source aggregation + rule-encoded heuristics + auto-remediation) and this 2024 LLM-ranker system are sibling realisations — same closed-feedback-loop posture + multi-source retrieval, different stage-2 reasoning substrate.
- concepts/continued-pretraining — extended with Meta's small-base-model / proprietary-corpus RCA variant. Complements eBay's 2025 e-Llama continued-pretraining at 1T tokens on Llama 3.1 70B.
- concepts/monorepo — the substrate whose structural affordance (structured ownership + code graph) makes heuristic retrieval tractable at scale; directly enables the RCA system's stage-1.
Fleet maintenance + production engineering patterns (2024-06-16 post)¶
- patterns/maintenance-train — Meta's canonical fleet-maintenance primitive: cyclic small-batch drains of a maintenance domain, contract "all capacity minus one maintenance domain" up 24/7, bounded full-visit cycle time. Used for all Meta capacity — compute and storage, not only AI. Canonical wiki reference.
- patterns/gradual-rollout-layered-by-stack-depth — two-layer rollout discipline: pin the job-facing layer (CUDA library + container) consistent across the cluster; slide the lower-level layer (firmware/drivers/kernel/OS) gradually through sliding-window rollouts. Reboot-required lower-layer upgrades take hours and cannot realistically be lock-stepped at Meta scale; container-layer restarts are cheap.
- concepts/maintenance-domain — the sized unit of capacity drained per maintenance action. Sizing trade-off: smaller domain = smaller buffer cost vs larger domain = fewer interruptions. Meta tunes AI-capacity domains toward lower interruption rate because synchronised training jobs pay whole-job cost for any interruption.
- concepts/overlapping-rollouts — Meta's explicit acceptance that at hyperscale, rollouts cannot be serialised into single-version-state windows. Architectural response: enforce pairwise component compatibility rather than cluster-wide coherence. Structural flip from small-scale practice.
- concepts/host-consistency-sliding-upgrade — the two-layer discipline that makes overlapping rollouts safe for synchronised AI jobs. Upper layer (CUDA + job container) pinned cluster-wide; lower layer allowed to drift during rollout; compatibility matrix engineered per
(pinned-upper × in-flight-lower)pair; pre-return verification gate enforces consistency at return-to-service. - concepts/fleet-patching — the capability class. Meta's variant is internal-compute-fleet with capacity-guarantee contract — distinct from MongoDB Atlas's managed-database variant, which has customer-facing windows but no capacity floor.
- concepts/maintenance-window — the service-to-service contract. Meta's "contract with services that allows them to avoid maintenance-train interruptions, if possible" is the internal-service-to-service variant of the MongoDB-Atlas customer-to-vendor variant.
- concepts/bad-host-detection — "bad hosts are very bad" is one of the five demanding GPU-training properties Meta names. Synchronised-job cost model: one bad host damages the whole job, not just its own share (superlinear vs proportional cost in the stateless-serving variant).
Hyperscale benchmarking + hardware co-design patterns (2024-08-05 DCPerf post)¶
- patterns/workload-representative-benchmark-from-production — Meta's load-bearing design rule for DCPerf: "Each benchmark within DCPerf is designed by referencing a large application within Meta's production server fleet." Canonical wiki reference. Validated publicly via IPC + core-frequency comparison graphs. Hardware-evaluation sibling of the application-layer custom harness at Figma.
- patterns/pre-silicon-validation-partnership — Meta's two-year collaboration with "leading CPU vendors" running DCPerf on pre-silicon / early-silicon setups, yielding "optimizations in areas such as CPU core microarchitecture settings and SOC power management optimizations." Open-sourcing DCPerf expands the pattern beyond Meta↔vendor to industry↔vendor.
- concepts/hyperscale-compute-workload — canonical statement that hyperscale is a distinct workload market segment from HPC / traditional enterprise, with distinct microarchitectural behaviour. First wiki canonicalisation.
- concepts/benchmark-representativeness — measurable-property concept DCPerf optimises for: match production IPC + frequency distributions. Inverse of concepts/benchmark-methodology-bias — Meta's evidence that SPEC CPU exhibits workload-shape bias for hyperscale is the representativeness-comparison graph.
- concepts/benchmark-methodology-bias — extended by this post into the hyperscale-hardware-procurement axis (complementing the Cloudflare Workers iteration-level-noise axis). SPEC CPU is biased for hyperscale not because of a sampling error but because of its workload-origin population (HPC / enterprise).
GenAI training patterns (2024-06-12 post)¶
- patterns/build-both-fabric-alternatives — Meta's decision to simultaneously build a RoCE 24K-GPU cluster and an InfiniBand 24K-GPU cluster to learn operationally rather than forecast the fabric tradeoff. Canonical wiki reference.
- patterns/dedicated-backend-training-fabric — physical FE/BE rack wiring + dedicated BE Clos so the training fabric can evolve on its own schedule. Canonical wiki reference from 2024-08-05.
- patterns/collective-library-transport-codesign — NCCL + RoCE transport + switch QoS as a single designed system. Meta's DCQCN-off posture only works because CTS-based admission + CTS-prioritised switch queues + deep-buffer CTSW + PFC all work together.
- patterns/minimum-cut-training-job-placement — topology-aware scheduler computes min-cut partition across AI Zones + recommends rank assignments so cross-Zone traffic (oversubscribed ATSW layer) is minimised.
- patterns/data-center-density-optimization — evict non-compute services (readers) from the GPU data hall; pack GPU racks densely within a single network cluster; accept air-cooling constraints when cooling infrastructure can't be changed in time.
- concepts/hardware-reliability-at-scale — failure rate scales with GPU count; Meta's operational response is monitoring + automation + spare capacity.
- concepts/gpu-training-failure-modes — GPU-falls-off-PCIe, DRAM/SRAM uncorrectable errors, network-cable failures; early-life biased distribution.
- concepts/training-checkpoint — efficient preservation of training state as a named first-class scaling property at 24K-GPU scale.
- concepts/collective-communication-topology-awareness — replacing default ring-allreduce with recursive-doubling/halving algorithms for latency-sensitive collectives at 24K-GPU scale.
- concepts/fat-flow-load-balancing — training produces long-lived, high-bandwidth flows that defeat ECMP hashing; explicit routing + load balancing investment required, especially for the RoCE cluster. The 2024-08-05 SIGCOMM paper adds the full evolution timeline: baseline ECMP → concepts/path-pinning (failed under partial rack allocation; 30%+ degradation) → E-ECMP + QP scaling (+40% AllReduce).
- concepts/ecmp-equal-cost-multipath / concepts/rdma-queue-pair / concepts/enhanced-ecmp-qp-scaling / concepts/path-pinning — the concepts that form Meta's RoCE routing stack.
- concepts/dcqcn / concepts/priority-flow-control / concepts/receiver-driven-traffic-admission — Meta's unconventional congestion-control posture: DCQCN off, PFC-only + NCCL CTS handshake as library-level admission.
- concepts/backend-frontend-network-separation — the dual-network rack wiring Meta uses to let the BE training fabric evolve independently.
Query/warehouse patterns¶
- Canary + shadow cluster rollout — dual-stage deployment pipeline for Presto engine releases.
- Bad-host auto-drain — attribute query failures per-host, alert on anomalous rates, auto-drain from the fleet.
- Automated cluster standup/decommission — full hardware → serving cluster pipeline with no manual steps.
- Gateway throttling by dimension — per-user, per-source, per-IP, and global query admission control at the Presto Gateway.
- Gateway autoscaling — horizontal elasticity for the query-gateway tier under adversarial traffic.
RTC audio-codec patterns (2024-06-13 MLow post)¶
- patterns/classic-dsp-over-ml-for-compute-constrained — canonical Meta example: built MLow (classical CELP + refinements) rather than ship Encodec (ML) because >20% of Meta calls are on ARMv7 devices and 10s of millions of daily WhatsApp calls are on 10+-year-old phones. ML codec quality doesn't matter if the target devices can't run it.
- patterns/aggressive-fec-at-low-bitrate — MLow's lower bitrate floor creates FEC headroom; Meta spends that headroom on redundancy rather than fidelity. Future work explicitly targets "pumping out more redundant audio, which MLow allows us to do efficiently."
- patterns/bandwidth-adaptive-codec-mode — RTC codec operating point is driven by a bandwidth-estimation module; lower operating floor is a first-class codec feature.
- concepts/low-end-device-inclusion — the Meta-specific framing of the constraint: codec/ML/rendering choices must serve the low-end device population, not just flagship handsets.
Source-control + monorepo-VCS patterns (2024-09-10 Sapling post)¶
- patterns/usability-first-vcs-cli — Sapling's canonical instance:
design the VCS CLI so every command does one thing, defaults are
sensible, the default view (
slalone) shows smartlog, and recovery commands (undo/redo/uncommit/unamend/hide/unhide) are first-class. Explicitly sized by Meta as a support-headcount multiplier — "the Sapling development team is small, and in order to support our tens of thousands of internal developers, we needed to make it as easy as possible to solve your own issues and get unblocked." - patterns/vcs-undo-tooling — treat undo as a subsystem, not a
recovery procedure. Named commands + interactive
sl undo -iscroller through old smartlog views. Post-quote: "never again should you have to delete your repository and clone again to get things working." Substrate: mutation history (Mercurial-Evolve-inspired). - patterns/first-class-commit-stack-workflow — Meta's unit of
code review is the stack of small commits, not the pull
request. Sapling makes this ergonomic via
sl goto+sl amend sl restack+sl absorb+sl amend --to COMMIT+sl fold+sl split, all safe under mutation-history tracking. Canonical wiki instance. Pairs with ReviewStack on the review-UI side.- patterns/lazy-history-on-demand — scale-side pattern: clone downloads almost nothing; history data is fetched on demand; queries stay fast via segment- graph bisection on a megabyte-scale Segmented Changelog index. Pattern-family cousin of patterns/blobless-clone-lazy-hydrate (Cloudflare Artifacts / ArtifactFS).
- patterns/organization-owned-sparse-profile — the architectural
move that makes sparse checkout operationally viable at scale:
check the sparse-checkout profile into the repo; product
teams own profile; engineers opt in by name; dependency changes
update the profile; every engineer picks up the new state on
next
checkout/rebase. Canonical wiki instance — "thousands of engineers to work on a constantly shifting subset of the repository without ever having to think about it." - concepts/vcs-usability — the canonical wiki concept: usability in a VCS is a first-class, independent design axis orthogonal to scale. Sapling's thesis is that a VCS can invest in both simultaneously.
- concepts/commit-stack — the reviewable-unit primitive.
- concepts/mutation-history-commit — the substrate for stack-editing and undo.
- concepts/lazy-history-download / concepts/segmented-changelog / concepts/commit-graph-bisection — the history-scale primitive trio.
- concepts/virtual-filesystem-for-monorepo / concepts/sparse-checkout / concepts/sparse-profile — the working-copy-scale primitives.
- concepts/monorepo — extended with the Meta-scale upper-bound framing as the regime where even a tuned Git on a managed SaaS stops being viable.
Developer tools + code-indexing patterns (2024-12-19 Glean post)¶
- patterns/centralized-ahead-of-time-indexing — Meta's load-bearing decision to run code indexing on a shared fleet ahead of time, replicate the databases across the query service, and expose the index to clients over the network instead of downloading it. Canonical wiki instance = Glean.
- patterns/language-neutral-schema-abstraction — keep the detailed language-specific schemas underneath, define language-neutral views over them in the schema language itself (Glean's derived predicates in Angle; analogous to SQL views). Canonical wiki instance = Glean + Angle. Lets Glass expose one uniform navigation API without forcing the underlying data to be lowest-common-denominator.
- patterns/diff-based-static-analysis — index each diff to produce a machine-readable diff sketch, then fan out to static analysis, semantic lint, rich notifications, commit-level semantic search (production stack-trace → recent-touching-commits), and review-time go-to-definition inside Phabricator. Canonical wiki instance = Glean + Phabricator across 8 languages.
- concepts/incremental-indexing — process just the changes; target O(changes), realistic floor O(fanout). Glean's position: the index is perpetually out of date at monorepo scale unless you commit to incrementality. Fanout closure itself is an Angle query.
- concepts/stacked-immutable-databases — representation substrate for incremental indexing: non-destructive layered adds/hides on top of a base database, queried as a single logical view per revision, delta-sized storage. Mechanism disclosed but details deferred.
- concepts/symbol-id — per-language stable string handle for symbols; URLs to docs + references survive code-motion. Glass owns the ID assignment.
- concepts/code-indexing — the general primitive; Glean is the canonical wiki instance of the IDE-local-to-centralized shift.
Anti-abuse rule-engine patterns (2015-06-26 Haskell post)¶
- patterns/rule-engine-with-continuous-policy-deploy — Sigma's "source code in the repository is the code running in Sigma" operational posture. Minutes from commit to fleet. Requires: a purely functional strongly typed policy language (type-correct-or-rejected at repo ingress), hot-code swapping runtime, policy-language performance within the request-path budget (no perf-critical logic trapped in the slower-deploying C++ layer), and interactive testing against production data. Canonical wiki instance = Meta Sigma.
- patterns/embedded-functional-runtime-in-cpp-service — Meta's Haskell-between-two-layers-of-C++ integration pattern: mature C++ Thrift server on top, existing C++ service-client libraries below wrapped as Haxl data sources via FFI (compile-time C++ name-demangler removes intermediate C shim for most calls). Get the managed-runtime properties (purity, type safety, implicit-concurrent fetching, hot-swap) exactly where they pay off (the request-evaluation layer) without rewriting transport or client libraries. Canonical wiki instance = Meta Sigma.
- concepts/hot-code-swapping — canonical wiki primitive for live policy reload. Sigma's three enabling conditions: short-lived requests (no in-flight migration), persistent-state code is never hot-swapped, GHC's reachability- based garbage collector detects when old compiled code is no longer referenced and triggers safe unload.
- concepts/implicit-concurrent-data-fetching — Haxl + Applicative do-notation together. The programmer writes pure functional code that looks sequential; the framework + compiler-level do-block rearrangement together batch same-source fetches and overlap fetches on different sources, without explicit concurrency constructs. Canonical industrial instance. Compiler-language co-design in service of a production concurrency property — a library alone cannot rearrange statements without changing the language.
- concepts/allocation-limit — per-thread memory cap enforced by the runtime, with safe termination via asynchronous exceptions. Meta added this to GHC upstream specifically for Sigma's multi-tenant request isolation. The thread is the blast-radius unit — a pathological request is terminated without affecting peer requests. Cooperative-scheduler sibling of OS-level RLIMITs; thread-scoped variant of concepts/blast-radius containment.
- concepts/purely-functional-policy-language — language-level property set Meta explicitly ties to operational safety: policies cannot interact, cannot crash the engine, are isolable for unit testing; strong types eliminate many bugs pre-production. Canonical wiki statement of language-as-production-safety- primitive for rule engines; pairs with type-correct-at-ingress as the repo-level safety gate.
Recommendation + social-discovery patterns (2026-03-18 Friend Bubbles post)¶
- patterns/survey-trained-closeness-model — new canonical pattern. Train an ML closeness model (or any latent-relationship-quantity model) against refreshed direct-survey labels rather than platform-activity proxies. Use platform activity as features, use survey answers as labels — the separation that breaks optimise-to-proxy failure modes. Meta's survey is "lightweight binary" (close vs not close) with proxy questions (e.g. "how often do you communicate") as additional signal. Weekly inference over trillions of pairs. Extends the wiki's ground-truth-labelling family alongside patterns/human-calibrated-llm-labeling and patterns/human-in-the-loop-quality-sampling.
- patterns/conditional-probability-ranking-objective — new canonical pattern. Add a new ranking signal to an existing multi-objective ranker as a conditional-probability term
P(outcome | new condition)with a tunable weight, not a new formula. Compatible with MTML as a new head. Meta's instance:P(video engagement | bubble impression)balances social-interaction with video-engagement objectives without abandoning existing tuning. - patterns/conditional-animation-for-scroll-performance — new canonical pattern. Gate UI animations on two conditions — interaction state (off during active scroll) + device class (off entirely on low-end devices) — rather than animating unconditionally. Treat animation as a budget item. Extends Meta's low-end-device inclusion posture (previously MLow-audio-codec-only) to client-UI-rendering; complements prefetch-window metadata co-attending as the fetch-side sibling discipline.
- patterns/closed-feedback-loop-ai-features — extended with the Friend Bubbles recommendation-system instance. Bubble-impression + engagement data flows back into training so MTML models keep learning friend-content resonance. Now canonical across RCA + Kotlinator + Friend Bubbles — three distinct Meta product domains.
- patterns/retrieve-then-rank-llm — extended: Friend Bubbles is the recommendation-system + MTML-ranker sibling instance to the RCA-system + LLM-ranker canonical instance. Both use heuristic stage-1 retrieval + a heavier stage-2 ranker; both demonstrate the retriever-recall-is-the-ceiling principle Meta states explicitly in Friend Bubbles ("high-quality friend content may never enter the ranking pipeline in the first place").
- concepts/retrieval-ranking-funnel — new concept. The canonical two-stage recommendation-system architecture generalised from the LLM-specific pattern. Explicit top-of-funnel expansion when a new candidate class is missing.
- concepts/viewer-friend-closeness — new concept. The ML-estimated social-relationship strength used as retrieval threshold + ranker feature. Weekly inference over trillions of pairs; canonical precomputed-feature pairing with concepts/feature-store.
- concepts/multi-task-multi-label-ranking — new concept. The ranker-architecture class at both early- and late-stage Reels ranking, the natural host for new conditional-probability tasks.
- concepts/prefetch-window-metadata-coattending — new concept. The client-side primitive piggybacking new per-video metadata on the existing video-prefetch path so scroll + playback don't regress.
Tier classification¶
Tier 1 — canonical hyperscale engineering source; systems and patterns surfaced from Meta posts cross-reference heavily with AWS, Google, Netflix, Cloudflare, LinkedIn.