Skip to content

CONCEPT Cited by 1 source

Request-level embedding broadcast

Definition

Request-level embedding broadcast is a serving-side optimisation for ranking / CTR models where a heavy per-entity embedding (typically the user embedding) is fetched once per unique entity per batch and then broadcast to the original per-request layout before model inference.

Structurally:

  Before:                              After:
  ───────                              ──────
  batch = [ (user_A, ad_1),            unique_users = { A, B }
            (user_A, ad_2),            embeddings = { A: emb_A, B: emb_B }    ← 2 lookups
            (user_A, ad_3),
            (user_B, ad_4),            broadcast back to batch layout:
            (user_B, ad_5) ]            [ emb_A, emb_A, emb_A, emb_B, emb_B ]

  5 user embedding lookups             Model inputs / outputs unchanged

The entity to broadcast is typically the user, because: (a) in candidate scoring, the same user embedding is needed for every candidate in the batch; (b) user embeddings are often the largest single lookup in the feature set; (c) the ratio of total lookups to unique users is the amplification factor (candidates per user per batch).

The broadcast mechanic

  • Deduplicate — identify unique entity keys in the batch.
  • Fetch unique — look up the embedding table once per unique key.
  • Broadcast — expand the unique-entity embeddings back to the original per-request layout using indexing / scatter operations.
  • Keep model inputs and outputs unchanged — the downstream model sees the same per-request tensor shape it expected before the optimisation.

Why it works

  • Candidate-scoring workloads naturally have high entity-reuse. Scoring N candidate ads for one user = N lookups of the same user embedding without the optimisation, 1 lookup with it. For batched requests across M users with N candidates each, reuse ratio is N.
  • Embedding lookups are memory-bound on GPU. Fewer lookups → less HBM pressure → better overall throughput, not just fewer ops.
  • Transparent to the model. The optimisation is a serving-layer rewrite — no model-architecture change, no retraining, no correctness risk from the compute perspective.

The failure mode

Request-level broadcasting assumes a bounded number of unique entities per batch. If a batch has more unique users than the implementation's capacity, the request fails (or the optimisation has to fall back to per-request lookup, losing the win).

Pinterest's operational safety mechanism (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable."

The tested-unique-user cap is tuned empirically — pick a number that covers near-100% of production batches in steady state, reject batches that exceed it.

Generalisations

The same pattern applies beyond user embeddings:

  • Per-item embedding broadcast — in two-tower retrieval, item embeddings are often reused across users in a batch.
  • Per-query embedding broadcast — in search ranking, the same query embedding is needed for every candidate result.
  • Per-context embedding broadcast — location, device, session embeddings are shared across candidates in a request.

The key criterion: entity is shared across many rows in the batch, and the per-lookup cost is high (heavy embedding, HBM-bound).

Caveats

  • Pinterest doesn't disclose the tested unique-user number, the batch-layout details, or whether the broadcast is GPU-native or host-side scatter.
  • Latency win magnitude not disclosed — only the qualitative "reduced serving latency" claim.
  • Cap-based failure is a real operational tax — batches exceeding the cap must be rejected or fall back, each is a degraded-mode request.
  • Inapplicable without reuse — for single-candidate requests (fan-in 1), the optimisation is a no-op with small overhead.

Seen in

Last updated · 319 distilled / 1,201 read