SYSTEM Cited by 1 source
FlexAttention¶
FlexAttention is PyTorch's API for customisable-yet-efficient attention: users write a Python score_mod function (attention-bias logic, masking, etc.) and PyTorch compiles it into a fused, performance-competitive GPU kernel. First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — Netflix cites FlexAttention integration into its internal optimised model definitions as one of the framework-level benefits of owning its own model-implementation layer (rather than training directly on transformers classes).
Why it matters in Netflix's framework¶
FlexAttention unlocks framework-level optimisations across all supported model families without requiring each family to re-implement attention from scratch. When Netflix ports a new model family into its framework (via logit-verifier-gated AI-agent bridges), FlexAttention is part of what the ported model inherits — along with memory-efficient chunked cross-entropy, MFU accounting, and uniform LoRA extensibility.