Skip to content

SYSTEM Cited by 1 source

Meta Adaptive Ranking Model

Definition

Meta Adaptive Ranking Model is the LLM-scale ads-ranking serving stack Meta Ads deployed in 2025 to serve model complexity equivalent to O(10 GFLOPs) per token — the range used by top-tier LLMs — under sub-second latency at Meta's request volume. Rather than brute-forcing hardware for per-ad-candidate inference, the system aligns model complexity with each request's context via intelligent request routing, so heavy model capacity is spent where it matters most (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Three architectural pillars

Meta names the system's foundations in the post:

  1. Inference-efficient model scaling — a request-centric computation model that eliminates per-(user, ad-candidate) redundancy via request-oriented computation sharing and sequence scaling, plus the Wukong Turbo runtime evolution of Meta Ads's internal Wukong architecture.
  2. Model-system co-designhardware-aware architectures tuned to accelerator capabilities, with selective FP8 quantisation applied only where layers tolerate it, and graph- + kernel-level specialisation (operator fusion for shared inputs, Grouped GEMM, horizontal fusion). Achieves 35% MFU across multiple hardware types (see concepts/model-flops-utilization).
  3. Reimagined serving infrastructure — multi-card GPU serving that breaks single-GPU memory limits via multi-card embedding sharding and unified embeddings, reaching O(1T) parameter scale.

The inference trilemma (the design frame)

Meta explicitly frames the problem as a three-way conflict:

  • Latency"Ads must be chosen and returned with sub-second latency. Scaling ads computation to LLM-scale level and beyond has traditionally been impossible without latency regressions that compromise user experience."
  • Cost"Brute force scaling by simply adding hardware is economically unsustainable."
  • Complexity — the driver of both; Meta wants "a deeper understanding of people's interests and intent."

See concepts/inference-trilemma-recsys.

Transforming scaling from linear to sub-linear

Traditional ranking processes each user-ad pair independently → massive redundancy. Adaptive Ranking Model computes high-density user signals once per request, not per ad candidate. Shared embeddings are broadcast across candidates directly inside the GPU kernel (in-kernel broadcast) — eliminating repeated HBM traffic and transforming the cost curve from linear-in-candidates to sub-linear.

Long user-behaviour sequences use the same principle, processed once per request and shared across candidates. Storage redundancy is eliminated by keeping a centralised, high-efficiency KV store of user logs joined with training data on the fly — not replicated into each training shard.

Neutralising feature-preprocessing overhead

Preprocessing was historically the bottleneck causing GPU starvation — "client memory pressure and data starvation where the GPU's compute power remains underutilized while waiting for processed features." The resolution:

  • Offload preprocessing from client CPU to remote GPU hosts.
  • Use compact tuple-based feature formats.
  • GPU-native kernels reducing Top-K from O(N log N) to O(N).
  • Data compression + client-flow restructuring to eliminate thread-pool contention.

Trillion-parameter embedding scale

Recsys is driven by sparse categorical features mapped to high-dimensional embedding tables. The hash-collision tradeoff — oversize causes overfitting, undersize causes hash collisions — forces principled sizing. Meta's approach:

  • Allocate embedding hash sizes based on feature sparsity.
  • Prune unused embeddings to maximise learning capacity within memory budgets.
  • Unified embeddings — multiple features share a single embedding table.

When combined embedding tables cross the terabyte boundary, single-GPU memory is exceeded and multi-card sharding takes over — achieving performance parity with single-card setups via hardware-specific communication optimisations.

Runtime resilience

  • Accelerated model loading — multi-stream downloading + remote caching load trillion-parameter models in under 10 minutes.
  • Auto-scaling on streaming-multiprocessor utilisation — handles spiky ads traffic without over-provisioning.

Deployment + outcomes

Launched on Instagram in Q4 2025. Reported impact:

  • +3% ad conversions for targeted users.
  • +5% click-through rate for targeted users.

See systems/meta-instagram for Meta's Instagram coverage.

Roadmap

Named future axes (not yet shipped):

  • Ultra-low precision quantisation extending beyond FP8 selective.
  • Agentic optimisation frameworks automatically adapting kernel performance to new hardware and model architectures.
  • Near-instantaneous model freshness via incremental in-place weight updates for real-time adaptation.

Caveats

  • Architecture-overview voice: no absolute QPS, fleet size, GPU count, inference p50 / p99, or per-request cost disclosed.
  • "Multiple hardware types" / "heterogeneous hardware" not named; vendor mix (H100 / B200 / MI300X / MTIA) left implicit.
  • +3% / +5% business lift is framed as "for targeted users", not overall fleet; targeting criterion and control composition undisclosed.
  • FP8-selection benchmark metric + cutoff not specified.
  • Pre-existing ads-ranking baseline not quantified in the post, so relative improvement from the shift is qualitative.

Seen in

Last updated · 319 distilled / 1,201 read