CONCEPT Cited by 1 source
Request-level embedding broadcast¶
Definition¶
Request-level embedding broadcast is a serving-side optimisation for ranking / CTR models where a heavy per-entity embedding (typically the user embedding) is fetched once per unique entity per batch and then broadcast to the original per-request layout before model inference.
Structurally:
Before: After:
─────── ──────
batch = [ (user_A, ad_1), unique_users = { A, B }
(user_A, ad_2), embeddings = { A: emb_A, B: emb_B } ← 2 lookups
(user_A, ad_3),
(user_B, ad_4), broadcast back to batch layout:
(user_B, ad_5) ] [ emb_A, emb_A, emb_A, emb_B, emb_B ]
5 user embedding lookups Model inputs / outputs unchanged
The entity to broadcast is typically the user, because: (a) in candidate scoring, the same user embedding is needed for every candidate in the batch; (b) user embeddings are often the largest single lookup in the feature set; (c) the ratio of total lookups to unique users is the amplification factor (candidates per user per batch).
The broadcast mechanic¶
- Deduplicate — identify unique entity keys in the batch.
- Fetch unique — look up the embedding table once per unique key.
- Broadcast — expand the unique-entity embeddings back to the original per-request layout using indexing / scatter operations.
- Keep model inputs and outputs unchanged — the downstream model sees the same per-request tensor shape it expected before the optimisation.
Why it works¶
- Candidate-scoring workloads naturally have high entity-reuse. Scoring N candidate ads for one user = N lookups of the same user embedding without the optimisation, 1 lookup with it. For batched requests across M users with N candidates each, reuse ratio is N.
- Embedding lookups are memory-bound on GPU. Fewer lookups → less HBM pressure → better overall throughput, not just fewer ops.
- Transparent to the model. The optimisation is a serving-layer rewrite — no model-architecture change, no retraining, no correctness risk from the compute perspective.
The failure mode¶
Request-level broadcasting assumes a bounded number of unique entities per batch. If a batch has more unique users than the implementation's capacity, the request fails (or the optimisation has to fall back to per-request lookup, losing the win).
Pinterest's operational safety mechanism (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable."
The tested-unique-user cap is tuned empirically — pick a number that covers near-100% of production batches in steady state, reject batches that exceed it.
Generalisations¶
The same pattern applies beyond user embeddings:
- Per-item embedding broadcast — in two-tower retrieval, item embeddings are often reused across users in a batch.
- Per-query embedding broadcast — in search ranking, the same query embedding is needed for every candidate result.
- Per-context embedding broadcast — location, device, session embeddings are shared across candidates in a request.
The key criterion: entity is shared across many rows in the batch, and the per-lookup cost is high (heavy embedding, HBM-bound).
Caveats¶
- Pinterest doesn't disclose the tested unique-user number, the batch-layout details, or whether the broadcast is GPU-native or host-side scatter.
- Latency win magnitude not disclosed — only the qualitative "reduced serving latency" claim.
- Cap-based failure is a real operational tax — batches exceeding the cap must be rejected or fall back, each is a degraded-mode request.
- Inapplicable without reuse — for single-candidate requests (fan-in 1), the optimisation is a no-op with small overhead.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical wiki instance: fetch each user embedding once per batch, broadcast to original per-request layout, tested-unique-user-cap as safety mechanism.
Related¶
- patterns/request-level-user-embedding-broadcast — the pattern version of this concept.
- systems/pinterest-ads-engagement-model
- companies/pinterest