SYSTEM Cited by 2 sources
FlashAttention¶
FlashAttention is a family of IO-aware GPU attention kernels (Dao et al., 2022+) that compute softmax-attention in tiles directly against shared memory, avoiding materialisation of the full N × N attention matrix in HBM. FlashAttention-family varlen kernels also provide padding-removal / variable-length attention so concurrent sequences of different lengths share a batch without padding tokens wasting GPU cycles.
This page is the canonical wiki entry under the slug flash-attention that source pages reference as [systems/flash-attention](<./flash-attention.md>). The sibling flashattention slug (systems/flashattention) covers the same system from the Netflix post-training-framework angle.
Use at Pinterest — baseline for DCAT¶
Pinterest's DCAT (Deduplicated Cross-Attention Transformer) achieves "significant throughput gains over standard self-attention with FlashAttention" (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication). FlashAttention is the performance baseline Pinterest compares DCAT against — standard self-attention with FlashAttention was the prior ranking-attention path before DCAT's two-phase context/crossing split replaced it.
Why DCAT beats FlashAttention for this workload: FlashAttention is IO-aware self-attention — it reduces HBM traffic per attention call, but still computes one call per candidate in a batch. DCAT changes the computation shape — it factors the user-sequence context pass out of the per-candidate path, so the cost asymmetry is 1 user-sequence pass per request + B cross-attention calls vs FlashAttention's B self-attention calls. For recsys ranking with shared user history, the context factor-out is a bigger win than any per-call IO optimisation.
Use at Netflix¶
First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix as part of Netflix's internal optimised model definitions — along with systems/flex-attention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility.
Seen in¶
- 2026-02-13 Netflix — Scaling LLM Post-Training at Netflix (sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix) — FlashAttention as part of Netflix's optimised post-training stack.
- 2026-04-13 Pinterest — Scaling Recommendation Systems with Request-Level Deduplication (sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication) — FlashAttention as the performance baseline Pinterest's DCAT displaces for recsys ranking attention.
Related¶
- systems/flashattention — sibling wiki entry under the unhyphenated slug.
- systems/flex-attention
- systems/pytorch
- systems/triton-lang
- systems/pinterest-dcat — the architecture that replaces FlashAttention for Pinterest recsys ranking.
- concepts/kv-cache — the general primitive DCAT's context pass populates.