CONCEPT Cited by 1 source
Request-oriented computation sharing¶
Definition¶
Request-oriented computation sharing is the architectural shift in recsys serving from per-(user, ad-candidate) independent inference to per-request shared computation. Heavy user-context computation runs once per request and its outputs are broadcast to every ad candidate, rather than being repeated for each candidate-pair independently. The shift transforms scaling costs from linear in candidate count to sub-linear (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
The redundancy it eliminates¶
In a traditional recsys forward pass:
for each candidate_ad in N_candidates:
user_features = encode_user(user) # heavy, redundant
ad_features = encode_ad(candidate_ad) # lightweight
score = rank(user_features, ad_features)
The user-encoding tower runs N times — pure redundancy, since the user is the same for all candidates in one request.
Request-oriented computation sharing flips the structure:
# Once per request
user_features = encode_user(user) # heavy, O(1) per request
# Fanned out across candidates
for each candidate_ad in N_candidates:
ad_features = encode_ad(candidate_ad) # still O(N)
score = rank(user_features, ad_features)
The heavy tower's cost is amortised across candidates; candidate- dependent work stays O(N). Total cost collapses from O(N) × heavy_user_work to O(1) × heavy_user_work + O(N) × light_candidate_work — sub-linear in the candidate count.
In-kernel broadcast¶
The Adaptive Ranking Model post names In-Kernel Broadcast as the GPU-kernel-level realisation of request-oriented sharing: "Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, which shares request-level embeddings across ad candidates directly within the GPU kernel, transforms scaling costs from linear to sub-linear while significantly reducing memory bandwidth pressure."
The pattern is structurally the same as batched matmul broadcasting: shared tensors loaded once into GPU shared memory / registers, reused across the ranking operations for every candidate in the batch — avoiding repeated HBM reads that would blow out memory bandwidth.
Why it particularly pays off at LLM scale¶
At LLM-scale complexity (O(10 GFLOPs) per token), the user tower is the dominant cost. Running it per-candidate is what drove the traditional approach to either:
- Drop the complexity — use a small user tower (giving up quality).
- Drop the candidate count — use coarser retrieval (reducing recall).
- Blow the latency budget — serve slowly (bad UX).
Request-oriented sharing is the path that keeps LLM-scale user modelling affordable under the sub-second budget. It is the architectural response that makes the inference trilemma tractable.
Implications for memory bandwidth¶
A secondary but significant benefit: memory-bandwidth pressure drops because shared embeddings are loaded into SRAM / registers once per request, not once per candidate. This is the same principle that makes memory-bound inference fixes pay off — the arithmetic intensity per byte loaded goes up.
Contrast with standard two-tower recsys¶
Classical two-tower retrieval (dual-encoder) has always precomputed ad embeddings offline and computed the user tower online — a dot-product at query time. Request-oriented computation sharing extends this principle up the ranking stack: not just the retrieval dot-product, but the full ranking forward pass shares user-context computation across candidates within a single request, not across requests over time.
Relationship to request-oriented sequence scaling¶
Request-oriented sequence scaling is the storage-layer sibling: long-form user behaviour sequences are processed once per request and stored in a centralised KV store rather than replicated into training data. Together the two concepts define Meta Adaptive Ranking Model's "Request-Oriented Optimization" pillar.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source; names In-Kernel Broadcast as the GPU-kernel-level realisation; attributes the "linear to sub-linear" scaling-curve bend to the combined effect of request-oriented sharing + in-kernel broadcast (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).