Skip to content

PATTERN Cited by 1 source

Cached KV cross-attention for deduplication

Problem

Ranking-stage transformers in recommendation systems let each candidate item attend to the user's history sequence — this coupling is what gives ranking models more expressive power than two-tower retrieval. But the item-to-user coupling means standard self-attention reruns the full user-sequence forward pass per candidate. With B user-item pairs in a batch from R unique requests (B/R candidates per request on average), the user-sequence compute runs B times — B/R copies of the same work.

At Pinterest's scale — ~16K-token user sequences, hundreds-to-thousands of candidates per request — the redundant user-sequence compute dominates ranking serving cost. Two-tower retrieval naturally dedupes this (item tower runs on all items, user tower runs once per user) but ranking attention architectures cannot be factored the same way: standard fused self-attention kernels (e.g. FlashAttention) treat every (query, key, value) position as a single attention call without exposing a factoring hook.

Pattern

Split the transformer into a context pass + a crossing pass:

  1. Context pass — apply the transformer to the user's action sequence once per deduplicated request. For each layer, cache the Keys and Values produced by the user sequence's self-attention.

  2. Crossing pass — each candidate item's representation performs cross-attention against the cached user-history KV. The cache is read-only here; candidates are queries, user KV is the reference.

Per request (runs R times in a batch of B user-item pairs):
  user_seq → Transformer context layers
           → cached KV per layer  (user-history representation)

Per candidate (runs B times):
  candidate → cross-attention against cached KV per layer
            → per-candidate contextualised representation

Gradients for the context pass are accumulated at the deduplicated level; gradients for the crossing pass are accumulated per candidate.

Pinterest's canonical instance — DCAT

Pinterest's DCAT (Deduplicated Cross-Attention Transformer) is the canonical wiki instance of this pattern:

"The key insight is to separate the transformer into two components: 1. Context: Apply the transformer to the user's historical action sequence once per deduplicated request. The keys and values (KV) from each layer are cached. 2. Crossing: Each candidate item performs cross-attention with the cached user history KV, reusing the deduplicated context computation. This optimization, implemented with custom Triton kernels for both training and serving, achieved significant throughput gains over standard self-attention with FlashAttention." (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication)

Custom Triton kernels for training + serving displace FlashAttention — FlashAttention's IO-aware self-attention is the right kernel for the original coupled shape, but it has no API for reusing user-sequence KV across a batch of per-item queries.

Production impact (Pinterest)

  • Training: 2× gain from DCAT cross-attention alone (the other 40% ranking-training speedup comes from deduplicated data loading — compounds to ~2.8× end-to-end ranking training).
  • Serving: 7× increase in ranking serving throughput — the headline number that "made it possible to deploy a 100× larger model without proportional serving cost increases."

(US, 2025, Pinterest internal data, citation "²".)

Shape analogy — LLM KV cache

The pattern is structurally the same primitive as the autoregressive-decode KV cache in LLM inference:

LLM KV cache DCAT
Shared prefix = prompt tokens Shared prefix = user action sequence
Reused across = subsequent decode tokens Reused across = candidate items
Populated by = prefill pass Populated by = context pass
Consumed by = per-token attention Consumed by = per-candidate cross-attention

The reuse unit differs (autoregressive tokens vs batched candidates), but the invariant holds — compute K/V for the shared input once, use per-item queries cross-attend to the cache.

Generalisations

Anywhere a transformer's query is "per candidate" but the key/value substrate is "per request," this factoring applies:

  • Ad ranking — user sequence shared, candidate ads vary.
  • Video recommendation — user-watch-history shared, candidate videos vary.
  • Search ranking — query tokens + user context shared, candidate documents vary.
  • Session-aware recsys — session tokens shared, candidate items vary.
  • Job-matching — user-profile embedding shared, candidate jobs vary.

Caveats

  • Architecture-specific — only relevant for ranking models where candidate items attend to a shared user sequence; item-only or user-only transformers get no benefit.
  • Custom kernels required — standard fused-attention libraries don't expose a KV-cache-across-batch hook; Pinterest wrote Triton kernels for both training + serving.
  • Pinterest doesn't disclose kernel-level details — sequence-length handling, gradient-accumulation arithmetic, cross-attention tile shape, mixed-precision choices all live in the Foundation Model paper, not the 2026-04-13 post.
  • "Significant throughput gains over FlashAttention" is qualitative — no input-shape, context-length, or candidate-count baselines.
  • Training vs serving symmetry — Pinterest uses the same Triton kernels for both, but whether gradient flow through the cached KV complicates training (vs a simpler serving-only implementation) is not discussed.
  • Memory footprint of the cached KV — per-request cache size grows with sequence length + layer count + head-dim; at scale this is another dimension to budget against candidate-count × context-length.

Seen in

  • 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki pattern instance: context + crossing split with cached user-history KV; custom Triton kernels; 2× training + 7× serving throughput over FlashAttention baseline; the architecture that absorbed Pinterest Foundation Model's 100× parameter scaleup without proportional cost growth.
Last updated · 550 distilled / 1,221 read