Skip to content

PATTERN Cited by 1 source

Request-centric inference architecture

Pattern

When serving a ranking model over many candidates per request (recommendations, search, ads), shift the unit of inference from (user, candidate) pairs to (request) events:

  • Heavy user-context computation runs once per request, not per candidate.
  • Shared user embeddings are broadcast across candidates in- kernel, eliminating redundant HBM traffic.
  • Long user-behaviour sequences are processed once per request and stored centrally (in a KV store), joined to training data on the fly rather than replicated per training row.

The pattern transforms scaling from linear in candidate count to sub-linear, unblocking LLM-scale model complexity within sub- second latency budgets (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Problem

Traditional recsys inference processes each (user, candidate) pair independently:

for each candidate in N_candidates:
    score = rank(user, candidate)   # re-encodes user every time

At LLM-scale complexity, the user-encoding work is the dominant cost. Repeating it N times per request is pure waste — the user is the same across candidates in one request. At sub-second SLO with candidate counts in the thousands, the waste makes LLM-scale ranking intractable.

Meanwhile, long-form user behaviour sequences (the input that most improves ranking quality) are both compute-expensive to process and storage-expensive to replicate into training data.

Solution

Restructure the inference pipeline around request as the unit of work:

1. Compute heavy user context once per request

# Once per request
user_ctx = heavy_user_tower(user)  # O(1) per request

# Fan out across candidates
for each candidate in N_candidates:
    ad_ctx = light_ad_tower(candidate)       # O(N)
    score  = rank(user_ctx, ad_ctx)          # O(N)

Heavy work is O(1) per request, not O(N). See concepts/request-oriented-computation-sharing.

2. Broadcast shared embeddings in-kernel

The shared user_ctx tensor lives in GPU shared memory / registers for the duration of the candidate fan-out, reused across candidates without repeated HBM loads. This is analogous to the broadcast semantics of batched matmul — applied at request granularity.

3. Store user sequences centrally, join on the fly

# Centralised KV store (serving + training)
kv_store[user_id] = user_logs   # O(users) storage

# Training time
training_row = join(request, kv_store[user_id])

See concepts/request-oriented-sequence-scaling. Eliminates replication across training rows and serving data stores.

Forces

  • Candidate count × heavy user tower = unbearable cost at LLM scale. A small-user-tower compromise gives up quality; a per- candidate full-tower approach blows latency and cost.
  • Sub-second latency is non-negotiable (UX + ads auction timing).
  • Cost efficiency is non-negotiable at Meta-scale QPS.
  • Long user sequences materially improve quality and cannot simply be dropped.

Consequences

Positive:

  • Scaling from O(N) to O(1) + O(light_N) in candidate count.
  • Memory bandwidth pressure drops (shared loads, not per-candidate reloads).
  • Storage footprint for user sequences drops from O(requests × candidates × len) to O(users × len).
  • Unlocks LLM-scale model complexity (O(10 GFLOPs) per token) within O(100 ms) bounded latency.

Negative / tradeoffs:

  • Requires restructuring the ranking pipeline — two-tower separation must be clean so that user-context outputs can be broadcast. Legacy architectures that entangle user + candidate features deep in the model cannot use this pattern without rewriting.
  • Centralised KV store is a new operational dependency — must be highly-available, low-latency, consistent across training and serving.
  • The kernel-level broadcast requires careful GPU-systems engineering; this is not a pure ML optimisation.

Canonical industrial instance

  • Meta Adaptive Ranking Model (2025, launched on Instagram Q4 2025) — the post that defines the pattern and demonstrates the "linear to sub- linear" scaling curve bend. Reports +3% conversions, +5% CTR for targeted Instagram users, with model complexity equivalent to top-tier LLMs under O(100 ms) bounded latency. Combined with the Wukong Turbo runtime refinements + model- system co-design + [[patterns/multi-card-sharded-embedding- serving|multi-card embedding sharding]].

Seen in

Last updated · 319 distilled / 1,201 read