SYSTEM Cited by 1 source
FlashAttention-2¶
FlashAttention-2 (Dao 2023) is the IO-aware attention kernel that cuts transformer attention memory from O(N²) to O(N) by tiling the softmax + attention-output computation into on-chip SRAM blocks, avoiding materialisation of the full N×N attention matrix in HBM. v2 improves on the original FlashAttention by better GPU utilisation (parallelism across sequence length, reduced non-matmul FLOPs, and work partitioning across warps).
Relevance to this wiki: standard integration in large-scale LLM training and long-context inference stacks, where attention would otherwise dominate memory pressure and limit context length.
Seen in (wiki)¶
- eBay e-Llama training. FlashAttention-2 is explicitly named as one of the optimizations wired into the Megatron-LM 3D-parallel training of e-Llama 8B + 70B on 480 H100 GPUs. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
Why it matters¶
- Attention memory at long context is the bottleneck. Standard attention materialises an N×N matrix in HBM; at training-time-realistic context lengths this dominates memory pressure. FlashAttention's blocked softmax keeps the matrix tiled and in SRAM.
- Matters most at scale. The longer the context and the bigger the model, the bigger the FlashAttention-2 win in both wall-clock and feasibility.
- Drops into training + inference frameworks. Megatron-LM, DeepSpeed, vLLM, TensorRT-LLM, SGLang etc. all have FlashAttention-2 integrations. For eBay's purposes, it arrives transparently via Megatron-LM.
Stub — expand as more sources cite the kernel (expected to appear in many forthcoming LLM training + inference posts).
Related¶
- systems/megatron-lm — integrates FlashAttention-2 as the attention kernel for LLM training at scale.
- systems/e-llama — uses FlashAttention-2 via Megatron-LM for continued pretraining.
- systems/nvidia-h100 — the GPU target FlashAttention-2 is tuned for.
- concepts/continued-pretraining — scale-sensitive technique that benefits from the kernel.