Skip to content

SYSTEM Cited by 1 source

FlashAttention

FlashAttention is a family of IO-aware GPU attention kernels (Dao et al., 2022+) that compute softmax-attention in tiles directly against shared memory, avoiding materialisation of the full N × N attention matrix in HBM. FlashAttention-family varlen kernels also provide padding-removal / variable-length attention so concurrent sequences of different lengths share a batch without padding tokens wasting GPU cycles.

First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix as part of Netflix's internal optimised model definitions — along with systems/flex-attention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility.

Last updated · 550 distilled / 1,221 read