SYSTEM Cited by 1 source

Meta Adaptive Ranking Model¶

Definition¶

Meta Adaptive Ranking Model is the LLM-scale ads-ranking serving stack Meta Ads deployed in 2025 to serve model complexity equivalent to O(10 GFLOPs) per token — the range used by top-tier LLMs — under sub-second latency at Meta's request volume. Rather than brute-forcing hardware for per-ad-candidate inference, the system aligns model complexity with each request's context via intelligent request routing, so heavy model capacity is spent where it matters most (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Three architectural pillars¶

Meta names the system's foundations in the post:

Inference-efficient model scaling — a request-centric computation model that eliminates per-(user, ad-candidate) redundancy via request-oriented computation sharing and sequence scaling, plus the Wukong Turbo runtime evolution of Meta Ads's internal Wukong architecture.
Model-system co-design — hardware-aware architectures tuned to accelerator capabilities, with selective FP8 quantisation applied only where layers tolerate it, and graph- + kernel-level specialisation (operator fusion for shared inputs, Grouped GEMM, horizontal fusion). Achieves 35% MFU across multiple hardware types (see concepts/model-flops-utilization).
Reimagined serving infrastructure — multi-card GPU serving that breaks single-GPU memory limits via multi-card embedding sharding and unified embeddings, reaching O(1T) parameter scale.

The inference trilemma (the design frame)¶

Meta explicitly frames the problem as a three-way conflict:

Latency — "Ads must be chosen and returned with sub-second latency. Scaling ads computation to LLM-scale level and beyond has traditionally been impossible without latency regressions that compromise user experience."
Cost — "Brute force scaling by simply adding hardware is economically unsustainable."
Complexity — the driver of both; Meta wants "a deeper understanding of people's interests and intent."

See concepts/inference-trilemma-recsys.

Transforming scaling from linear to sub-linear¶

Traditional ranking processes each user-ad pair independently → massive redundancy. Adaptive Ranking Model computes high-density user signals once per request, not per ad candidate. Shared embeddings are broadcast across candidates directly inside the GPU kernel (in-kernel broadcast) — eliminating repeated HBM traffic and transforming the cost curve from linear-in-candidates to sub-linear.

Long user-behaviour sequences use the same principle, processed once per request and shared across candidates. Storage redundancy is eliminated by keeping a centralised, high-efficiency KV store of user logs joined with training data on the fly — not replicated into each training shard.

Neutralising feature-preprocessing overhead¶

Preprocessing was historically the bottleneck causing GPU starvation — "client memory pressure and data starvation where the GPU's compute power remains underutilized while waiting for processed features." The resolution:

Offload preprocessing from client CPU to remote GPU hosts.
Use compact tuple-based feature formats.
GPU-native kernels reducing Top-K from O(N log N) to O(N).
Data compression + client-flow restructuring to eliminate thread-pool contention.

Trillion-parameter embedding scale¶

Recsys is driven by sparse categorical features mapped to high-dimensional embedding tables. The hash-collision tradeoff — oversize causes overfitting, undersize causes hash collisions — forces principled sizing. Meta's approach:

Allocate embedding hash sizes based on feature sparsity.
Prune unused embeddings to maximise learning capacity within memory budgets.
Unified embeddings — multiple features share a single embedding table.

When combined embedding tables cross the terabyte boundary, single-GPU memory is exceeded and multi-card sharding takes over — achieving performance parity with single-card setups via hardware-specific communication optimisations.

Runtime resilience¶

Accelerated model loading — multi-stream downloading + remote caching load trillion-parameter models in under 10 minutes.
Auto-scaling on streaming-multiprocessor utilisation — handles spiky ads traffic without over-provisioning.

Deployment + outcomes¶

Launched on Instagram in Q4 2025. Reported impact:

+3% ad conversions for targeted users.
+5% click-through rate for targeted users.

See systems/meta-instagram for Meta's Instagram coverage.

Roadmap¶

Named future axes (not yet shipped):

Ultra-low precision quantisation extending beyond FP8 selective.
Agentic optimisation frameworks automatically adapting kernel performance to new hardware and model architectures.
Near-instantaneous model freshness via incremental in-place weight updates for real-time adaptation.

Caveats¶

Architecture-overview voice: no absolute QPS, fleet size, GPU count, inference p50 / p99, or per-request cost disclosed.
"Multiple hardware types" / "heterogeneous hardware" not named; vendor mix (H100 / B200 / MI300X / MTIA) left implicit.
+3% / +5% business lift is framed as "for targeted users", not overall fleet; targeting criterion and control composition undisclosed.
FP8-selection benchmark metric + cutoff not specified.
Pre-existing ads-ranking baseline not quantified in the post, so relative improvement from the shift is qualitative.

Seen in¶

2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (first-party launch post) (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).