Skip to content

SYSTEM Cited by 2 sources

FlashAttention

FlashAttention is a family of IO-aware GPU attention kernels (Dao et al., 2022+) that compute softmax-attention in tiles directly against shared memory, avoiding materialisation of the full N × N attention matrix in HBM. FlashAttention-family varlen kernels also provide padding-removal / variable-length attention so concurrent sequences of different lengths share a batch without padding tokens wasting GPU cycles.

This page is the canonical wiki entry under the slug flash-attention that source pages reference as [systems/flash-attention](<./flash-attention.md>). The sibling flashattention slug (systems/flashattention) covers the same system from the Netflix post-training-framework angle.

Use at Pinterest — baseline for DCAT

Pinterest's DCAT (Deduplicated Cross-Attention Transformer) achieves "significant throughput gains over standard self-attention with FlashAttention" (Source: sources/2026-04-13-pinterest-scaling-recommendation-systems-with-request-level-deduplication). FlashAttention is the performance baseline Pinterest compares DCAT against — standard self-attention with FlashAttention was the prior ranking-attention path before DCAT's two-phase context/crossing split replaced it.

Why DCAT beats FlashAttention for this workload: FlashAttention is IO-aware self-attention — it reduces HBM traffic per attention call, but still computes one call per candidate in a batch. DCAT changes the computation shape — it factors the user-sequence context pass out of the per-candidate path, so the cost asymmetry is 1 user-sequence pass per request + B cross-attention calls vs FlashAttention's B self-attention calls. For recsys ranking with shared user history, the context factor-out is a bigger win than any per-call IO optimisation.

Use at Netflix

First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix as part of Netflix's internal optimised model definitions — along with systems/flex-attention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility.

Seen in

Last updated · 550 distilled / 1,221 read