Skip to content

CONCEPT Cited by 1 source

Request-oriented computation sharing

Definition

Request-oriented computation sharing is the architectural shift in recsys serving from per-(user, ad-candidate) independent inference to per-request shared computation. Heavy user-context computation runs once per request and its outputs are broadcast to every ad candidate, rather than being repeated for each candidate-pair independently. The shift transforms scaling costs from linear in candidate count to sub-linear (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

The redundancy it eliminates

In a traditional recsys forward pass:

for each candidate_ad in N_candidates:
    user_features = encode_user(user)         # heavy, redundant
    ad_features   = encode_ad(candidate_ad)   # lightweight
    score         = rank(user_features, ad_features)

The user-encoding tower runs N times — pure redundancy, since the user is the same for all candidates in one request.

Request-oriented computation sharing flips the structure:

# Once per request
user_features = encode_user(user)          # heavy, O(1) per request

# Fanned out across candidates
for each candidate_ad in N_candidates:
    ad_features = encode_ad(candidate_ad)  # still O(N)
    score       = rank(user_features, ad_features)

The heavy tower's cost is amortised across candidates; candidate- dependent work stays O(N). Total cost collapses from O(N) × heavy_user_work to O(1) × heavy_user_work + O(N) × light_candidate_worksub-linear in the candidate count.

In-kernel broadcast

The Adaptive Ranking Model post names In-Kernel Broadcast as the GPU-kernel-level realisation of request-oriented sharing: "Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, which shares request-level embeddings across ad candidates directly within the GPU kernel, transforms scaling costs from linear to sub-linear while significantly reducing memory bandwidth pressure."

The pattern is structurally the same as batched matmul broadcasting: shared tensors loaded once into GPU shared memory / registers, reused across the ranking operations for every candidate in the batch — avoiding repeated HBM reads that would blow out memory bandwidth.

Why it particularly pays off at LLM scale

At LLM-scale complexity (O(10 GFLOPs) per token), the user tower is the dominant cost. Running it per-candidate is what drove the traditional approach to either:

  • Drop the complexity — use a small user tower (giving up quality).
  • Drop the candidate count — use coarser retrieval (reducing recall).
  • Blow the latency budget — serve slowly (bad UX).

Request-oriented sharing is the path that keeps LLM-scale user modelling affordable under the sub-second budget. It is the architectural response that makes the inference trilemma tractable.

Implications for memory bandwidth

A secondary but significant benefit: memory-bandwidth pressure drops because shared embeddings are loaded into SRAM / registers once per request, not once per candidate. This is the same principle that makes memory-bound inference fixes pay off — the arithmetic intensity per byte loaded goes up.

Contrast with standard two-tower recsys

Classical two-tower retrieval (dual-encoder) has always precomputed ad embeddings offline and computed the user tower online — a dot-product at query time. Request-oriented computation sharing extends this principle up the ranking stack: not just the retrieval dot-product, but the full ranking forward pass shares user-context computation across candidates within a single request, not across requests over time.

Relationship to request-oriented sequence scaling

Request-oriented sequence scaling is the storage-layer sibling: long-form user behaviour sequences are processed once per request and stored in a centralised KV store rather than replicated into training data. Together the two concepts define Meta Adaptive Ranking Model's "Request-Oriented Optimization" pillar.

Seen in

Last updated · 319 distilled / 1,201 read