CONCEPT Cited by 1 source

Request-level embedding broadcast¶

Definition¶

Request-level embedding broadcast is a serving-side optimisation for ranking / CTR models where a heavy per-entity embedding (typically the user embedding) is fetched once per unique entity per batch and then broadcast to the original per-request layout before model inference.

Structurally:

  Before:                              After:
  ───────                              ──────
  batch = [ (user_A, ad_1),            unique_users = { A, B }
            (user_A, ad_2),            embeddings = { A: emb_A, B: emb_B }    ← 2 lookups
            (user_A, ad_3),
            (user_B, ad_4),            broadcast back to batch layout:
            (user_B, ad_5) ]            [ emb_A, emb_A, emb_A, emb_B, emb_B ]

  5 user embedding lookups             Model inputs / outputs unchanged

The entity to broadcast is typically the user, because: (a) in candidate scoring, the same user embedding is needed for every candidate in the batch; (b) user embeddings are often the largest single lookup in the feature set; (c) the ratio of total lookups to unique users is the amplification factor (candidates per user per batch).

The broadcast mechanic¶

Deduplicate — identify unique entity keys in the batch.
Fetch unique — look up the embedding table once per unique key.
Broadcast — expand the unique-entity embeddings back to the original per-request layout using indexing / scatter operations.
Keep model inputs and outputs unchanged — the downstream model sees the same per-request tensor shape it expected before the optimisation.

Why it works¶

Candidate-scoring workloads naturally have high entity-reuse. Scoring N candidate ads for one user = N lookups of the same user embedding without the optimisation, 1 lookup with it. For batched requests across M users with N candidates each, reuse ratio is N.
Embedding lookups are memory-bound on GPU. Fewer lookups → less HBM pressure → better overall throughput, not just fewer ops.
Transparent to the model. The optimisation is a serving-layer rewrite — no model-architecture change, no retraining, no correctness risk from the compute perspective.

The failure mode¶

Request-level broadcasting assumes a bounded number of unique entities per batch. If a batch has more unique users than the implementation's capacity, the request fails (or the optimisation has to fall back to per-request lookup, losing the win).

Pinterest's operational safety mechanism (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable."

The tested-unique-user cap is tuned empirically — pick a number that covers near-100% of production batches in steady state, reject batches that exceed it.

Generalisations¶

The same pattern applies beyond user embeddings:

Per-item embedding broadcast — in two-tower retrieval, item embeddings are often reused across users in a batch.
Per-query embedding broadcast — in search ranking, the same query embedding is needed for every candidate result.
Per-context embedding broadcast — location, device, session embeddings are shared across candidates in a request.

The key criterion: entity is shared across many rows in the batch, and the per-lookup cost is high (heavy embedding, HBM-bound).

Caveats¶

Pinterest doesn't disclose the tested unique-user number, the batch-layout details, or whether the broadcast is GPU-native or host-side scatter.
Latency win magnitude not disclosed — only the qualitative "reduced serving latency" claim.
Cap-based failure is a real operational tax — batches exceeding the cap must be rejected or fall back, each is a degraded-mode request.
Inapplicable without reuse — for single-candidate requests (fan-in 1), the optimisation is a no-op with small overhead.

Seen in¶

2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical wiki instance: fetch each user embedding once per batch, broadcast to original per-request layout, tested-unique-user-cap as safety mechanism.

patterns/request-level-user-embedding-broadcast — the pattern version of this concept.
systems/pinterest-ads-engagement-model
companies/pinterest