Skip to content

SYSTEM Cited by 1 source

FlashAttention-2

FlashAttention-2 (Dao 2023) is the IO-aware attention kernel that cuts transformer attention memory from O(N²) to O(N) by tiling the softmax + attention-output computation into on-chip SRAM blocks, avoiding materialisation of the full N×N attention matrix in HBM. v2 improves on the original FlashAttention by better GPU utilisation (parallelism across sequence length, reduced non-matmul FLOPs, and work partitioning across warps).

Relevance to this wiki: standard integration in large-scale LLM training and long-context inference stacks, where attention would otherwise dominate memory pressure and limit context length.

Seen in (wiki)

Why it matters

  • Attention memory at long context is the bottleneck. Standard attention materialises an N×N matrix in HBM; at training-time-realistic context lengths this dominates memory pressure. FlashAttention's blocked softmax keeps the matrix tiled and in SRAM.
  • Matters most at scale. The longer the context and the bigger the model, the bigger the FlashAttention-2 win in both wall-clock and feasibility.
  • Drops into training + inference frameworks. Megatron-LM, DeepSpeed, vLLM, TensorRT-LLM, SGLang etc. all have FlashAttention-2 integrations. For eBay's purposes, it arrives transparently via Megatron-LM.

Stub — expand as more sources cite the kernel (expected to appear in many forthcoming LLM training + inference posts).

Last updated · 200 distilled / 1,178 read