Skip to content

META 2026-03-31

Read original ↗

Meta — Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads

Summary

Meta Engineering's 2026-03-31 ML Applications post describes Meta Adaptive Ranking Model, the serving stack Meta built to scale its Ads ranking models to LLM-scale complexityO(10 GFLOPs) per token — while maintaining sub-second latency for billions of requests per second. The architectural thesis is that the "inference trilemma" (complexity vs. latency vs. cost) cannot be solved by throwing more hardware at per-ad-candidate inference. Instead, Meta shifts to a request-centric computation model that amortises heavy user-context work once per request rather than once per ad candidate, adds aggressive model-system co-design (selective FP8, hardware-aware graph/kernel specialisation, Grouped GEMM, horizontal fusion) to drive MFU to 35% across heterogeneous hardware, and shards embeddings across multiple GPUs to reach O(1T) parameter scale. The system launched on Instagram in Q4 2025 with reported +3% ad conversions and +5% CTR for targeted users.

Key takeaways

  1. The inference trilemma is the design frame. Meta names three uncompromising constraints that conflict at LLM scale: "Latency impacts user experience: Ads must be chosen and returned with sub-second latency. Scaling ads computation to LLM-scale level and beyond has traditionally been impossible without latency regressions"; "Cost efficiency is crucial: Brute force scaling by simply adding hardware is economically unsustainable"; and the implicit third — model complexity itself, which drives both. (Source text; see concepts/inference-trilemma-recsys)
  2. LLM-scale means O(10 GFLOPs) per token, but one order of magnitude faster than LLM chat. "Adaptive Ranking Model achieves a model complexity equivalent to the O(10 GFLOPs) per token used by top-tier LLMs. However, it operates an order of magnitude faster than standard LLM inference, maintaining O(100 ms) bounded latency." Chatbots budget seconds; ads budget hundreds of milliseconds — structurally different. (Source text)
  3. Request-oriented computation sharing transforms scaling from linear to sub-linear. "Traditional models process each user-ad pair independently, creating massive computational redundancy. Adaptive Ranking Model eliminates this through Request-Oriented Optimization, which computes high-density user signals once per request rather than once per ad candidate." The heavy user tower runs once; the lightweight ad-scoring tower fans out across candidates. (Source text; see concepts/request-oriented-computation-sharing and patterns/request-centric-inference-architecture)
  4. In-kernel broadcast shares user embeddings across ad candidates without extra memory traffic. "Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, which shares request-level embeddings across ad candidates directly within the GPU kernel, transforms scaling costs from linear to sub-linear while significantly reducing memory bandwidth pressure." The pattern is identical in shape to batched-matmul broadcasting, applied at recsys request granularity. (Source text)
  5. Long user-behaviour sequences are unlocked by centralised KV storage joined to training data on the fly. "Request-Oriented Sequence Scaling unlocks the use of long-form user behavior sequences that were previously limited by compute and storage costs. To minimize compute overhead, Adaptive Ranking Model processes heavy sequences once per request and shares the results across all ad candidates. To optimize storage, it replaces redundant data replication with a centralized, high-efficiency key-value store of user logs that are joined with training data on the fly." (Source text; see concepts/request-oriented-sequence-scaling)
  6. Wukong Turbo is the runtime evolution of Meta's prior Wukong recsys architecture. "While Request-Oriented Optimization optimizes the computation flow, Wukong Turbo is the optimized runtime evolution of the Meta Ads internal architecture. Building on the Wukong architecture that uses stackable factorization machines, sequence learning and cross-layer attention, Wukong Turbo introduces specific refinements to handle the numeric instability and network overhead that typically arise when scaling deep models." No-bias approach removes unstable terms; small-parameter delegation offloads parameters from FSDP to DDP; sparsity-based simplification reduces redundant linear-layer components. (Source text; see systems/wukong-turbo)
  7. Feature preprocessing moved from CPU client to remote GPU hosts with GPU-native kernels. "Adaptive Ranking Model offloads preprocessing from the client CPU to remote GPU hosts, utilizing compact tuple-based formats and GPU-native kernels that reduce Top-K complexity from O(N log N) to O(N)." Thread-pool contention in the client was the symptom; GPU-side processing is the resolution. (Source text)
  8. Selective FP8 quantization — deploy FP8 only on layers with high precision-loss tolerance. "Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low-precision quantization often degrades the nuance required for complex ads ranking. Adaptive Ranking Model overcomes this through a post-training quantization strategy that applies FP8 selectively. Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance." (Source text; see concepts/selective-fp8-quantization and patterns/selective-mixed-precision-quantization)
  9. Graph and kernel specialisation drive MFU to 35% across heterogeneous hardware. "We fuse operators that share inputs to minimize data movement between high-bandwidth memory and on-chip SRAM. Additionally, thousands of small operations are consolidated into compute-dense kernels using techniques like Grouped General Matrix Multiply and horizontal fusion." Named outcome: "we've boosted model FLOPs utilization (MFU) to 35% across multiple hardware types." (Source text; see concepts/model-flops-utilization, concepts/hardware-aware-model-architecture, patterns/model-system-codesign-ranking)
  10. Embeddings cross the terabyte boundary, forcing multi-card sharding. "Mapping these IDs to high-dimensional embedding tables creates a critical trade-off where oversized tables lead to overfitting, while undersized tables suffer from hash collisions that degrade model quality." The resolution: "As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU. To mitigate this, a multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster." (Source text; see concepts/hash-collision-embedding-tradeoff, concepts/multi-card-embedding-sharding, patterns/multi-card-sharded-embedding-serving)
  11. Unified embeddings reduce memory footprint via table sharing across features. "This is further optimized by unified embeddings, which allow multiple features to share a single embedding table to significantly reduce the memory footprint without sacrificing the ability to learn complex feature interactions." (Source text; see concepts/unified-embeddings)
  12. Accelerated model loading keeps trillion-parameter deployments below 10 minutes. "We developed accelerated model loading that utilizes multi-stream downloading and remote caching to load models in under 10 minutes, minimizing downtime during deployments. Auto-scaling rules based on streaming multiprocessor utilization allows the system to handle fluctuating traffic dynamically." (Source text)
  13. Launched on Instagram Q4 2025 with measurable business lift. "Since launching on Instagram in Q4 2025, Adaptive Ranking Model has delivered a +3% increase in ad conversions and +5% increase in ad click through rate for targeted users." These are the only quantitative outcome numbers in the post. (Source text; see systems/meta-instagram)
  14. Forward roadmap names the next axes: ultra-low precision, agentic kernel optimisation, near-instantaneous model freshness. "We are pioneering a new era of inference execution efficiency, leveraging advanced model compression and ultra-low precision quantization methods" · "we are exploring agentic optimization frameworks to further accelerate kernel performance optimizations" · "we're reimaging the speed of learning through near-instantaneous model freshness, utilizing incremental, in-place weight updates to achieve constant, real-time adaptation." (Source text)

Systems extracted

  • systems/meta-adaptive-ranking-model — the headline system; the LLM-scale ads ranking serving stack. Three-pillar architecture: inference-efficient model scaling (request-oriented optimisation + Wukong Turbo), model-system co-design (selective FP8 + graph/kernel specialisation), reimagined serving infra (O(1T) params + multi-card embeddings + runtime resilience). 35% MFU, sub-second latency, O(100 ms) bounded; deployed on Instagram Q4 2025.
  • systems/wukong-turbo — the optimised runtime evolution of Meta Ads's internal Wukong architecture used inside Adaptive Ranking Model. Adds no-bias for numerical stability, small-parameter delegation from FSDP to DDP to reduce network overhead, and sparsity-based linear-layer simplification — without increasing FLOPs or parameter counts.
  • systems/wukong-meta — stub for the foundational 2024 Wukong architecture paper (arXiv:2403.02545) that Wukong Turbo builds on: stackable factorisation machines, sequence learning, cross-layer attention. Predecessor generation.
  • systems/meta-instagram — existing system page; updated with the Adaptive Ranking Model deployment as the Q4 2025 ads-ranking surface.

Concepts extracted

New:

  • concepts/inference-trilemma-recsys — the three-way tension at LLM-scale recsys serving: model complexity vs. sub-second latency vs. cost efficiency, where scaling any one axis naively degrades the other two. Meta's explicit design frame for Adaptive Ranking Model.
  • concepts/request-oriented-computation-sharing — the architectural shift from per-user-ad-pair independent processing to per-request computation. Heavy user signals computed once, shared across ad candidates via in-kernel broadcast. Transforms scaling costs from linear to sub-linear in candidate count.
  • concepts/request-oriented-sequence-scaling — unlocks long-form user behaviour sequences by processing them once per request and sharing results across candidates, and by replacing redundant data replication with a centralised KV store of user logs joined with training data on the fly.
  • concepts/selective-fp8-quantization — post-training quantisation strategy that applies FP8 only to layers the model tolerates without quality loss, identified via micro-benchmark-guided selection. The alternative to naive full-FP8 casts that degrade ranking nuance.
  • concepts/multi-card-embedding-sharding — architectural primitive for serving embedding tables that exceed single-GPU memory. Tables are split into segments across an optimised hardware cluster with hardware-specific communication optimisations; achieves parity with single-card setups while decoupling model complexity from single-GPU memory ceilings.
  • concepts/unified-embeddings — memory-optimisation primitive letting multiple features share a single embedding table, reducing memory footprint without sacrificing learning capacity for complex feature interactions.
  • concepts/hash-collision-embedding-tradeoff — the core tension in embedding-table sizing for categorical features: oversize → overfitting; undersize → hash collisions degrade model quality. Motivates the sparsity-aware allocation + pruning approach Adaptive Ranking Model uses.
  • concepts/hardware-aware-model-architecture — model design discipline of aligning model structure with underlying hardware capabilities and limitations (dtype support, memory hierarchy, kernel-launch overhead) so that model complexity translates directly into utilisation. Canonical statement tied to 35% MFU outcome.

Existing (reinforced):

  • concepts/model-flops-utilization — Meta's 35% MFU across heterogeneous hardware is a new data point extending this wiki concept from the existing MongoDB Voyage AI inference instance into the recsys-serving domain.

Patterns extracted

New:

  • patterns/request-centric-inference-architecture — the overall architectural pattern. Shift the unit of inference from (user, ad-candidate) pairs to (request) events; compute heavy user context once per request, share across candidates via in-kernel broadcast, store long user sequences centrally and join to training data on the fly. Transforms compute + storage + memory-bandwidth costs from linear to sub-linear in candidate count.
  • patterns/model-system-codesign-ranking — the concrete set of co-design techniques Meta applies to drive MFU: selective FP8 quantisation guided by micro-benchmarks, fused operators to minimise HBM↔SRAM traffic, Grouped General Matrix Multiply and horizontal fusion to consolidate small ops into compute-dense kernels, alignment of graph structure to modern GPU architectures.
  • patterns/multi-card-sharded-embedding-serving — serving-layer pattern for embedding tables that exceed single-GPU memory: horizontal sharding across a hardware-aware cluster with communication optimisations, achieving performance parity with single-card setups while decoupling model scale from single-GPU memory capacity.
  • patterns/selective-mixed-precision-quantization — post-training quantisation applied per-layer based on a tolerance benchmark, not blanket-cast. The operational alternative to FP8-everywhere for models where some layers are precision-sensitive and others are not.

Operational numbers

  • Per-token compute complexity: O(10 GFLOPs) per token (LLM-scale equivalent).
  • Latency budget: O(100 ms) bounded, sub-second overall.
  • Model FLOPs utilisation (MFU): 35% across multiple hardware types.
  • Parameter scale: O(1T) — trillion-parameter regime, enabled by multi-card embedding sharding.
  • Model loading: under 10 minutes using multi-stream downloading + remote caching.
  • Business outcome (Instagram Q4 2025): +3% ad conversions, +5% CTR for targeted users.
  • Preprocessing complexity: Top-K reduced from O(N log N) to O(N) via GPU-native kernels.

Caveats

  1. Architecture-overview voice — no absolute QPS / fleet size / GPU count / inference p50 / p99 / per-request watts / $ per request disclosed. "Sub-second" and "O(100 ms)" are the only latency bounds; no p-tail specified.
  2. Hardware vendors unnamed — post says "across multiple hardware types" and "heterogeneous hardware" but does not name NVIDIA H100 / B200, AMD MI300X, or Meta's own MTIA silicon. Given Meta's 2024 ads-ranking hardware disclosures (Grand Teton + MI300X positioning), the mix likely includes at least H100 and MI300X; post does not confirm.
  3. "+3% conversions / +5% CTR" is framed as "for targeted users" not overall fleet — which users are targeted, how the targeted population compares to the control, and how lift is computed are not disclosed.
  4. FP8 selection mechanism is called "micro-benchmark guided" but the benchmark metric (loss delta threshold? downstream metric regression?) + cutoff criterion are not specified.
  5. Wukong Turbo refinements (no-bias, small-parameter delegation, sparsity simplification) are described qualitatively with no throughput / parameter-count / FLOP-delta numbers.
  6. Unified embeddings — no collision-avoidance detail (which features can share? what's the collision-detection mechanism? what's the quality impact?); post asserts "without sacrificing learning capacity" but provides no evidence.
  7. In-kernel broadcast — no description of the GPU kernel-level mechanism (CUDA primitive used? warp-scope? block-scope? explicit shared-memory staging?); the "kernel" framing is at the ML-systems level of abstraction.
  8. Multi-card sharding — no numbers on card count, sharding granularity, communication pattern (all-to-all? hierarchical?), or embedding-lookup latency relative to the sub-second budget.
  9. Auto-scaling"based on streaming multiprocessor utilization" is the signal; no disclosure of target utilisation, scale-up / scale-down hysteresis, or how SM utilisation reconciles with request-QPS as the SLO driver.
  10. Comparison to prior ads-ranking system — no disclosure of the prior system's scale / MFU / parameter count / FLOPs budget, so the relative improvement from the architectural shift is not quantifiable from this post alone.
  11. Cross-reference to Wukong paper — the linked arXiv:2403.02545 carries the ranking-architecture detail (stackable factorisation machines, sequence learning, cross-layer attention) that this post summarises in one sentence; that paper is not ingested into the wiki and is a candidate for a future deep-dive if further Wukong material surfaces.
  12. Agentic kernel optimisation is mentioned as future work, with no current deployment — not yet a wiki-distillable primitive.

Source

Last updated · 319 distilled / 1,201 read