SYSTEM Cited by 1 source
Meta Adaptive Ranking Model¶
Definition¶
Meta Adaptive Ranking Model is the LLM-scale ads-ranking serving stack Meta Ads deployed in 2025 to serve model complexity equivalent to O(10 GFLOPs) per token — the range used by top-tier LLMs — under sub-second latency at Meta's request volume. Rather than brute-forcing hardware for per-ad-candidate inference, the system aligns model complexity with each request's context via intelligent request routing, so heavy model capacity is spent where it matters most (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Three architectural pillars¶
Meta names the system's foundations in the post:
- Inference-efficient model scaling — a request-centric computation model that eliminates per-(user, ad-candidate) redundancy via request-oriented computation sharing and sequence scaling, plus the Wukong Turbo runtime evolution of Meta Ads's internal Wukong architecture.
- Model-system co-design — hardware-aware architectures tuned to accelerator capabilities, with selective FP8 quantisation applied only where layers tolerate it, and graph- + kernel-level specialisation (operator fusion for shared inputs, Grouped GEMM, horizontal fusion). Achieves 35% MFU across multiple hardware types (see concepts/model-flops-utilization).
- Reimagined serving infrastructure — multi-card GPU serving that breaks single-GPU memory limits via multi-card embedding sharding and unified embeddings, reaching O(1T) parameter scale.
The inference trilemma (the design frame)¶
Meta explicitly frames the problem as a three-way conflict:
- Latency — "Ads must be chosen and returned with sub-second latency. Scaling ads computation to LLM-scale level and beyond has traditionally been impossible without latency regressions that compromise user experience."
- Cost — "Brute force scaling by simply adding hardware is economically unsustainable."
- Complexity — the driver of both; Meta wants "a deeper understanding of people's interests and intent."
See concepts/inference-trilemma-recsys.
Transforming scaling from linear to sub-linear¶
Traditional ranking processes each user-ad pair independently → massive redundancy. Adaptive Ranking Model computes high-density user signals once per request, not per ad candidate. Shared embeddings are broadcast across candidates directly inside the GPU kernel (in-kernel broadcast) — eliminating repeated HBM traffic and transforming the cost curve from linear-in-candidates to sub-linear.
Long user-behaviour sequences use the same principle, processed once per request and shared across candidates. Storage redundancy is eliminated by keeping a centralised, high-efficiency KV store of user logs joined with training data on the fly — not replicated into each training shard.
Neutralising feature-preprocessing overhead¶
Preprocessing was historically the bottleneck causing GPU starvation — "client memory pressure and data starvation where the GPU's compute power remains underutilized while waiting for processed features." The resolution:
- Offload preprocessing from client CPU to remote GPU hosts.
- Use compact tuple-based feature formats.
- GPU-native kernels reducing Top-K from O(N log N) to O(N).
- Data compression + client-flow restructuring to eliminate thread-pool contention.
Trillion-parameter embedding scale¶
Recsys is driven by sparse categorical features mapped to high-dimensional embedding tables. The hash-collision tradeoff — oversize causes overfitting, undersize causes hash collisions — forces principled sizing. Meta's approach:
- Allocate embedding hash sizes based on feature sparsity.
- Prune unused embeddings to maximise learning capacity within memory budgets.
- Unified embeddings — multiple features share a single embedding table.
When combined embedding tables cross the terabyte boundary, single-GPU memory is exceeded and multi-card sharding takes over — achieving performance parity with single-card setups via hardware-specific communication optimisations.
Runtime resilience¶
- Accelerated model loading — multi-stream downloading + remote caching load trillion-parameter models in under 10 minutes.
- Auto-scaling on streaming-multiprocessor utilisation — handles spiky ads traffic without over-provisioning.
Deployment + outcomes¶
Launched on Instagram in Q4 2025. Reported impact:
- +3% ad conversions for targeted users.
- +5% click-through rate for targeted users.
See systems/meta-instagram for Meta's Instagram coverage.
Roadmap¶
Named future axes (not yet shipped):
- Ultra-low precision quantisation extending beyond FP8 selective.
- Agentic optimisation frameworks automatically adapting kernel performance to new hardware and model architectures.
- Near-instantaneous model freshness via incremental in-place weight updates for real-time adaptation.
Caveats¶
- Architecture-overview voice: no absolute QPS, fleet size, GPU count, inference p50 / p99, or per-request cost disclosed.
- "Multiple hardware types" / "heterogeneous hardware" not named; vendor mix (H100 / B200 / MI300X / MTIA) left implicit.
- +3% / +5% business lift is framed as "for targeted users", not overall fleet; targeting criterion and control composition undisclosed.
- FP8-selection benchmark metric + cutoff not specified.
- Pre-existing ads-ranking baseline not quantified in the post, so relative improvement from the shift is qualitative.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (first-party launch post) (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).