PATTERN Cited by 1 source
Request-centric inference architecture¶
Pattern¶
When serving a ranking model over many candidates per request (recommendations, search, ads), shift the unit of inference from (user, candidate) pairs to (request) events:
- Heavy user-context computation runs once per request, not per candidate.
- Shared user embeddings are broadcast across candidates in- kernel, eliminating redundant HBM traffic.
- Long user-behaviour sequences are processed once per request and stored centrally (in a KV store), joined to training data on the fly rather than replicated per training row.
The pattern transforms scaling from linear in candidate count to sub-linear, unblocking LLM-scale model complexity within sub- second latency budgets (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Problem¶
Traditional recsys inference processes each (user, candidate) pair independently:
At LLM-scale complexity, the user-encoding work is the dominant cost. Repeating it N times per request is pure waste — the user is the same across candidates in one request. At sub-second SLO with candidate counts in the thousands, the waste makes LLM-scale ranking intractable.
Meanwhile, long-form user behaviour sequences (the input that most improves ranking quality) are both compute-expensive to process and storage-expensive to replicate into training data.
Solution¶
Restructure the inference pipeline around request as the unit of work:
1. Compute heavy user context once per request¶
# Once per request
user_ctx = heavy_user_tower(user) # O(1) per request
# Fan out across candidates
for each candidate in N_candidates:
ad_ctx = light_ad_tower(candidate) # O(N)
score = rank(user_ctx, ad_ctx) # O(N)
Heavy work is O(1) per request, not O(N). See concepts/request-oriented-computation-sharing.
2. Broadcast shared embeddings in-kernel¶
The shared user_ctx tensor lives in GPU shared memory /
registers for the duration of the candidate fan-out, reused across
candidates without repeated HBM loads. This is analogous to the
broadcast semantics of batched matmul — applied at request
granularity.
3. Store user sequences centrally, join on the fly¶
# Centralised KV store (serving + training)
kv_store[user_id] = user_logs # O(users) storage
# Training time
training_row = join(request, kv_store[user_id])
See concepts/request-oriented-sequence-scaling. Eliminates replication across training rows and serving data stores.
Forces¶
- Candidate count × heavy user tower = unbearable cost at LLM scale. A small-user-tower compromise gives up quality; a per- candidate full-tower approach blows latency and cost.
- Sub-second latency is non-negotiable (UX + ads auction timing).
- Cost efficiency is non-negotiable at Meta-scale QPS.
- Long user sequences materially improve quality and cannot simply be dropped.
Consequences¶
Positive:
- Scaling from O(N) to O(1) + O(light_N) in candidate count.
- Memory bandwidth pressure drops (shared loads, not per-candidate reloads).
- Storage footprint for user sequences drops from O(requests × candidates × len) to O(users × len).
- Unlocks LLM-scale model complexity (O(10 GFLOPs) per token) within O(100 ms) bounded latency.
Negative / tradeoffs:
- Requires restructuring the ranking pipeline — two-tower separation must be clean so that user-context outputs can be broadcast. Legacy architectures that entangle user + candidate features deep in the model cannot use this pattern without rewriting.
- Centralised KV store is a new operational dependency — must be highly-available, low-latency, consistent across training and serving.
- The kernel-level broadcast requires careful GPU-systems engineering; this is not a pure ML optimisation.
Canonical industrial instance¶
- Meta Adaptive Ranking Model (2025, launched on Instagram Q4 2025) — the post that defines the pattern and demonstrates the "linear to sub- linear" scaling curve bend. Reports +3% conversions, +5% CTR for targeted Instagram users, with model complexity equivalent to top-tier LLMs under O(100 ms) bounded latency. Combined with the Wukong Turbo runtime refinements + model- system co-design + [[patterns/multi-card-sharded-embedding- serving|multi-card embedding sharding]].
Related patterns¶
- patterns/model-system-codesign-ranking — the kernel- level co-design techniques (selective FP8, Grouped GEMM, horizontal fusion) that Meta pairs with request-centric architecture to drive MFU to 35%.
- patterns/multi-card-sharded-embedding-serving — the memory-side sibling; multi-card embedding sharding unblocks the parameter-count corner of the trilemma.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).