PATTERN Cited by 1 source
Request-level user-embedding broadcast¶
Pattern¶
On the serving side of a candidate-scoring ML model, deduplicate per-user embedding lookups at the batch level: fetch each unique user's embedding once, then broadcast the embedding back to the original per-candidate request layout. Model inputs and outputs remain unchanged; only the embedding-table lookup count drops.
Problem¶
Ads / recsys / search ranking workloads often have the following batch shape:
batch = [ (user_A, candidate_1),
(user_A, candidate_2),
(user_A, candidate_3),
(user_B, candidate_4),
(user_B, candidate_5) ]
Naive serving performs one user-embedding lookup per row — here 5 lookups for 2 unique users. At Pinterest-scale serving with:
- Large user-embedding tables (potentially billions of users × wide embeddings),
- Heavy per-lookup cost (HBM-bound, expensive index gather),
- High per-batch reuse (many candidates per user per request),
the redundant lookups dominate serving latency and waste HBM bandwidth. The user embedding is a natural deduplication candidate: it's the same for all candidates scored for one user in one request.
Solution¶
Three-phase transformation of the batch:
Phase 1: Deduplicate.
unique_users = unique_keys(batch) → { A, B }
Phase 2: Fetch unique.
emb_A = embedding_table[A] ← one lookup
emb_B = embedding_table[B] ← one lookup
Phase 3: Broadcast.
per_row_embeddings = scatter/index(
[emb_A, emb_B],
row_to_user_index = [0, 0, 0, 1, 1]
)
# → [emb_A, emb_A, emb_A, emb_B, emb_B]
Phase 4: Pass to model unchanged.
model_output = model(per_row_embeddings, candidate_features)
The model sees exactly the same input/output shape as before the optimisation — the broadcast is a serving-layer rewrite that's invisible to the model weights, training pipeline, and downstream consumers.
Canonical wiki reference¶
Pinterest's unified ads engagement model applies this pattern (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"We reduced redundant embedding table look up work with request-level broadcasting. Instead of repeating heavy user embedding lookups for every candidate/request in a batch, we fetch embeddings once per unique user and then broadcast them back to the original request layout, keeping model inputs and outputs unchanged. The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable."
Operational safety mechanism¶
The per-batch unique-user cap is the critical operational constraint:
- Implementations need pre-allocated buffer capacity for the deduplicated embeddings (lookup result buffers, scatter-index tensors).
- Batches that exceed the pre-allocated capacity must either fail (Pinterest's choice) or fall back to per-request lookup (slower).
- Pinterest empirically tuned the cap — pick a number that covers near-100% of production batches in steady state, reject excess.
When to apply¶
- Candidate-scoring workloads where the same entity (user, query, context) is scored against many candidates in a batch.
- Heavy per-lookup cost (large embedding tables, HBM-bound) where amortisation matters.
- Predictable batch structure with a bounded max-unique-entities per batch.
When NOT to apply¶
- Single-candidate scoring (fan-in = 1) — no reuse to amortise; optimisation is pure overhead.
- Highly variable batch structure — unpredictable unique-entity counts can trigger frequent failures.
- Small embedding tables where the lookup cost is negligible — optimisation not worth the engineering.
- Workloads where entities are not shared across batch rows (e.g. one user per query, one query per batch).
Generalisations¶
- Per-item embedding broadcast in two-tower retrieval — item embeddings reused across users.
- Per-query embedding broadcast in search ranking — query embedding reused across candidate results.
- Per-context embedding broadcast — location, device, session contexts reused across candidates.
Any time a shared entity is reused across many rows of a batch and per-lookup cost is high, the pattern applies.
Caveats¶
- Pinterest doesn't disclose the tested-unique-user cap number, the implementation language / framework, or the failure-mode semantics (retry vs hard-fail).
- Win magnitude not disclosed. "Reduced serving latency" — qualitative only.
- Cap-rejected batches represent degraded availability in tail conditions; tail latency of this pattern is worse than steady-state latency.
- Distinct from per-batch attention caching (KV cache in LLM serving) — the broadcast is purely about deduplication of static embedding lookups, not dynamic attention state.
- Potential correctness pitfall. The broadcast assumes the model consumes the user embedding as a read-only feature — any model that modifies the user embedding per-candidate (rare but possible in attention-based user-candidate interaction) would break.
Related concepts / patterns¶
- concepts/request-level-embedding-broadcast — the concept version.
- systems/pinterest-ads-engagement-model — canonical wiki instance.
- companies/pinterest
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: fetch each user embedding once per batch, broadcast to original per-request layout, tested-unique-user-cap as safety mechanism.