PATTERN Cited by 1 source
Model-system co-design for ranking¶
Pattern¶
For LLM-scale ranking inference, drive MFU close to hardware peak by co-designing the model with the underlying GPU hardware:
- Selective low-precision quantisation (FP8 only where layers tolerate it).
- Operator fusion for shared inputs — minimise HBM ↔ SRAM traffic when multiple operators read the same tensor.
- Small-op consolidation — convert thousands of tiny kernel launches into compute-dense kernels via Grouped General Matrix Multiply and horizontal fusion.
- Graph alignment with Tensor Core tile shapes — so nominal model FLOPs translate into realised Tensor Core throughput.
Meta's outcome: 35% Model FLOPs Utilisation across multiple hardware types inside Meta Adaptive Ranking Model (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Problem¶
A model's nominal FLOPs count and the realised FLOPs throughput on the GPU often diverge by an order of magnitude. MFU of 5-10% is common on naive deployments; the hardware is mostly idle waiting for memory, launching kernels, or running low-density ops that don't fill the Tensor Cores.
For LLM-scale ranking at Meta's serving volume, low MFU means:
- Latency budget blown — compute that should fit in the sub-second budget doesn't.
- Cost per request balloons — the GPU is billed at peak regardless of how much work it actually does.
Without co-design, LLM-scale complexity is architecturally infeasible.
Solution¶
Pair the model and the GPU system design tightly:
1. Selective FP8 quantisation¶
Meta applies selective FP8 only to layers with micro-benchmark-verified precision-loss tolerance. FP8 doubles Tensor Core throughput over BF16 where it applies, without the quality regression a blanket FP8 cast would cause in a ranking-sensitive domain.
2. Operator fusion for shared inputs¶
"We fuse operators that share inputs to minimize data movement between high-bandwidth memory and on-chip SRAM."
Fusing operators that read the same tensor means the tensor is loaded from HBM once, kept in SRAM across the fused operations, and results are written back once. Bandwidth pressure drops by a factor proportional to the fusion depth.
3. Grouped GEMM + horizontal fusion¶
"Thousands of small operations are consolidated into compute- dense kernels using techniques like Grouped General Matrix Multiply and horizontal fusion."
Grouped GEMM batches many small matrix multiplications into one kernel launch — critical for the wide mix of small linear layers + embedding-aggregation ops typical in ranking models. Horizontal fusion executes independent ops in the same kernel so they share launch overhead and fill the SMs together.
4. Graph alignment with hardware¶
"This precise alignment between the computation graph and modern GPU architectures significantly reduces the memory footprint and increases effective hardware utilization, ensuring that LLM-scale model complexity translates directly into performance."
The model architecture itself is chosen with the hardware in mind — so shape decisions (matrix dimensions, attention patterns, layer structure) hit Tensor Core paths cleanly.
Forces¶
- Serving latency is pinned by user-facing SLOs (sub-second for ads).
- Serving cost is pinned by finance — "add more hardware" is not a resolution at Meta-scale QPS.
- Model complexity is pinned by quality targets — the ML team won't accept a quality regression.
- MFU is the lever that reclaims latency and cost from the hardware without touching the other three constraints.
Consequences¶
Positive:
- Realises the nominal compute of the model at ~35% of hardware peak — ~2-5× better than naive deployment.
- Reduces latency + per-request cost without reducing model complexity.
- Works across heterogeneous hardware (Meta states "across multiple hardware types").
Negative / tradeoffs:
- Heavy engineering cost — model + systems engineers co- authoring the model architecture + kernel + graph compiler, not handing off between teams.
- Per-hardware tuning — kernel-level optimisations have portability limits; heterogeneous fleets require per-target work.
- Harder to evolve the model — architectural changes now require re-verification of MFU impact, not just quality impact.
Canonical industrial instance¶
- Meta Adaptive Ranking Model (2026-03-31) — canonical recsys instance. Named outcome: 35% MFU across heterogeneous hardware via selective FP8 + operator fusion + Grouped GEMM + horizontal fusion + graph alignment.
Related patterns¶
- patterns/request-centric-inference-architecture — the pipeline-shape pattern co-deployed in Adaptive Ranking Model; co-design works best when the architecture has already reduced redundant computation.
- patterns/multi-card-sharded-embedding-serving — the memory-side sibling; addresses embedding-scale rather than compute-scale.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).