SYSTEM Cited by 1 source

FlashAttention-2¶

FlashAttention-2 (Dao 2023) is the IO-aware attention kernel that cuts transformer attention memory from O(N²) to O(N) by tiling the softmax + attention-output computation into on-chip SRAM blocks, avoiding materialisation of the full N×N attention matrix in HBM. v2 improves on the original FlashAttention by better GPU utilisation (parallelism across sequence length, reduced non-matmul FLOPs, and work partitioning across warps).

Relevance to this wiki: standard integration in large-scale LLM training and long-context inference stacks, where attention would otherwise dominate memory pressure and limit context length.

Seen in (wiki)¶

eBay e-Llama training. FlashAttention-2 is explicitly named as one of the optimizations wired into the Megatron-LM 3D-parallel training of e-Llama 8B + 70B on 480 H100 GPUs. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)

Why it matters¶

Attention memory at long context is the bottleneck. Standard attention materialises an N×N matrix in HBM; at training-time-realistic context lengths this dominates memory pressure. FlashAttention's blocked softmax keeps the matrix tiled and in SRAM.
Matters most at scale. The longer the context and the bigger the model, the bigger the FlashAttention-2 win in both wall-clock and feasibility.
Drops into training + inference frameworks. Megatron-LM, DeepSpeed, vLLM, TensorRT-LLM, SGLang etc. all have FlashAttention-2 integrations. For eBay's purposes, it arrives transparently via Megatron-LM.

Stub — expand as more sources cite the kernel (expected to appear in many forthcoming LLM training + inference posts).

systems/megatron-lm — integrates FlashAttention-2 as the attention kernel for LLM training at scale.
systems/e-llama — uses FlashAttention-2 via Megatron-LM for continued pretraining.
systems/nvidia-h100 — the GPU target FlashAttention-2 is tuned for.
concepts/continued-pretraining — scale-sensitive technique that benefits from the kernel.

FlashAttention-2¶

Seen in (wiki)¶

Why it matters¶

Related¶