Skip to content

PATTERN Cited by 1 source

Request-level user-embedding broadcast

Pattern

On the serving side of a candidate-scoring ML model, deduplicate per-user embedding lookups at the batch level: fetch each unique user's embedding once, then broadcast the embedding back to the original per-candidate request layout. Model inputs and outputs remain unchanged; only the embedding-table lookup count drops.

Problem

Ads / recsys / search ranking workloads often have the following batch shape:

batch = [ (user_A, candidate_1),
          (user_A, candidate_2),
          (user_A, candidate_3),
          (user_B, candidate_4),
          (user_B, candidate_5) ]

Naive serving performs one user-embedding lookup per row — here 5 lookups for 2 unique users. At Pinterest-scale serving with:

  • Large user-embedding tables (potentially billions of users × wide embeddings),
  • Heavy per-lookup cost (HBM-bound, expensive index gather),
  • High per-batch reuse (many candidates per user per request),

the redundant lookups dominate serving latency and waste HBM bandwidth. The user embedding is a natural deduplication candidate: it's the same for all candidates scored for one user in one request.

Solution

Three-phase transformation of the batch:

Phase 1: Deduplicate.
  unique_users = unique_keys(batch)  →  { A, B }

Phase 2: Fetch unique.
  emb_A = embedding_table[A]  ← one lookup
  emb_B = embedding_table[B]  ← one lookup

Phase 3: Broadcast.
  per_row_embeddings = scatter/index(
      [emb_A, emb_B],
      row_to_user_index = [0, 0, 0, 1, 1]
  )
  # → [emb_A, emb_A, emb_A, emb_B, emb_B]

Phase 4: Pass to model unchanged.
  model_output = model(per_row_embeddings, candidate_features)

The model sees exactly the same input/output shape as before the optimisation — the broadcast is a serving-layer rewrite that's invisible to the model weights, training pipeline, and downstream consumers.

Canonical wiki reference

Pinterest's unified ads engagement model applies this pattern (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We reduced redundant embedding table look up work with request-level broadcasting. Instead of repeating heavy user embedding lookups for every candidate/request in a batch, we fetch embeddings once per unique user and then broadcast them back to the original request layout, keeping model inputs and outputs unchanged. The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable."

Operational safety mechanism

The per-batch unique-user cap is the critical operational constraint:

  • Implementations need pre-allocated buffer capacity for the deduplicated embeddings (lookup result buffers, scatter-index tensors).
  • Batches that exceed the pre-allocated capacity must either fail (Pinterest's choice) or fall back to per-request lookup (slower).
  • Pinterest empirically tuned the cap — pick a number that covers near-100% of production batches in steady state, reject excess.

When to apply

  • Candidate-scoring workloads where the same entity (user, query, context) is scored against many candidates in a batch.
  • Heavy per-lookup cost (large embedding tables, HBM-bound) where amortisation matters.
  • Predictable batch structure with a bounded max-unique-entities per batch.

When NOT to apply

  • Single-candidate scoring (fan-in = 1) — no reuse to amortise; optimisation is pure overhead.
  • Highly variable batch structure — unpredictable unique-entity counts can trigger frequent failures.
  • Small embedding tables where the lookup cost is negligible — optimisation not worth the engineering.
  • Workloads where entities are not shared across batch rows (e.g. one user per query, one query per batch).

Generalisations

  • Per-item embedding broadcast in two-tower retrieval — item embeddings reused across users.
  • Per-query embedding broadcast in search ranking — query embedding reused across candidate results.
  • Per-context embedding broadcast — location, device, session contexts reused across candidates.

Any time a shared entity is reused across many rows of a batch and per-lookup cost is high, the pattern applies.

Caveats

  • Pinterest doesn't disclose the tested-unique-user cap number, the implementation language / framework, or the failure-mode semantics (retry vs hard-fail).
  • Win magnitude not disclosed. "Reduced serving latency" — qualitative only.
  • Cap-rejected batches represent degraded availability in tail conditions; tail latency of this pattern is worse than steady-state latency.
  • Distinct from per-batch attention caching (KV cache in LLM serving) — the broadcast is purely about deduplication of static embedding lookups, not dynamic attention state.
  • Potential correctness pitfall. The broadcast assumes the model consumes the user embedding as a read-only feature — any model that modifies the user embedding per-candidate (rare but possible in attention-based user-candidate interaction) would break.

Seen in

Last updated · 319 distilled / 1,201 read