SYSTEM Cited by 1 source

DCAT (Deduplicated Cross-Attention Transformer)¶

Definition¶

DCAT — Deduplicated Cross-Attention Transformer — is Pinterest's transformer architecture for ranking models where each candidate item must attend to the user's history sequence. DCAT breaks the item-candidate coupling that standard self-attention imposes, replacing it with a two-phase context + crossing structure that computes the expensive user-history forward pass once per deduplicated request rather than once per candidate (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).

Described in detail in the Pinterest Foundation Model paper (arXiv 2507.12704, ACM RecSys 2025 oral spotlight).

The problem DCAT solves¶

Ranking models use long user-history sequences (at Pinterest: ~16K tokens). Standard self-attention over the sequence is:

Deduplicable at the user level — the sequence is identical across all candidates in a request.
Coupled at the item level — each candidate attends to the user sequence, so naïve self-attention reruns the full forward pass per candidate.

Two-tower retrieval is deduplicable by definition — the user tower has no item dependencies. Ranking transformers have item dependencies — each candidate attends to user history, creating the item-candidate coupling that stops naïve deduplication.

DCAT's key insight — separate context from crossing¶

Split the transformer into two components (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication):

1. Context¶

"Apply the transformer to the user's historical action sequence once per deduplicated request. The keys and values (KV) from each layer are cached."

The context pass is amortised across all candidates for the request. For B user-item pairs scored with R unique requests (so B/R candidates per request on average), the context pass runs R times rather than B — a B/R× reduction in the user-tower compute.

2. Crossing¶

"Each candidate item performs cross-attention with the cached user history KV, reusing the deduplicated context computation."

The crossing pass attends from each candidate to the cached user-history KV. This is the item-specific computation — still runs B times — but it reads the cached KV instead of recomputing the user sequence.

Shape analogy to LLM inference¶

DCAT is structurally analogous to the KV cache in autoregressive LLM inference: compute K and V for the shared prefix once, reuse across every subsequent token's attention. In DCAT:

The "prefix" is the user's action sequence.
The "subsequent tokens" are the candidate items being scored.
The "cache" is the per-layer user-history KV.

The reuse unit differs — LLMs reuse across autoregressive decode steps, DCAT reuses across candidates in one batch — but the primitive (populate KV from shared input once, cross-attend from per-item queries) is the same.

Implementation¶

Custom Triton kernels for both training and serving.
Displaces standard self-attention with FlashAttention — Pinterest's prior ranking-attention baseline.
"Achieved significant throughput gains over standard self-attention with FlashAttention" (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication).

Production impact¶

Training: 2× gain from DCAT cross-attention alone (the other 40% ranking-training speedup is from deduplicated data loading, compounding to ~2.8× end-to-end ranking-training speedup).
Serving: 7× ranking serving throughput — the headline Pinterest number for the full scaleup; DCAT is "what made it possible to deploy a 100× larger model without proportional serving cost increases."

(Pinterest internal data, US 2025, citation "²".)

Caveats¶

Kernel-level detail not disclosed in the 2026-04-13 post — sequence-length handling, attention-head shape, numerical-precision choices, batch-size × unique-user distribution all live in the Foundation Model paper.
Comparison is qualitative — "significant throughput gains" vs FlashAttention without specified input shapes, context lengths, candidate counts, or hardware.
DCAT is architecture-specific — retrieval doesn't need it (two-tower is already deduplicable); ranking models without item-to-user attention don't need it.
Gradient accumulation at the deduplicated level is asserted but flow / gradient-scaling details not disclosed.
Compatibility with other attention optimisations (paged attention, sparse attention, sliding-window) not disclosed.

Seen in¶

2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — canonical wiki system page. Two-phase context + crossing split; custom Triton kernels; 2× training gain + 7× serving throughput; named as the ranking-side analogue to two-tower retrieval's built-in deduplicability.

systems/pinterest-foundation-model — the model that consumes DCAT at serving time.
concepts/request-level-deduplication — the discipline DCAT instantiates for ranking serving.
concepts/kv-cache — structural sibling in LLM inference.
patterns/cached-kv-cross-attention-for-deduplication — DCAT's reusable-pattern framing.
systems/flash-attention — the kernel baseline DCAT displaces.
systems/triton-lang — the DSL DCAT's custom kernels are written in.
systems/pinterest-ads-engagement-model — sibling ranking system; request-level embedding broadcast is its narrow-scope analog.
companies/pinterest