PATTERN Cited by 1 source
Weight-only vs activation quantization¶
Pattern¶
When deploying quantized attention models on GPU Tensor Cores, choose between weight-only quantization (A16W4-style: high-precision activations, low-bit weights) and activation quantization (A8W8-style: both activations and weights at 8-bit) based on whether the inference workload is memory-bound or compute-bound. Neither is a universal win (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Why¶
Under pre-MXFP formats, when activations and weights use different bit-widths, weights must be explicitly dequantized up to activation precision before the MMA. The MMA still runs at the higher activation precision — the dequant is pure arithmetic overhead on the MMA path.
This cost has opposite sign in different regimes:
| Regime | Dominant constraint | Winning strategy | Why |
|---|---|---|---|
| Memory-bound (small batch, reasoning, decoding) | HBM bandwidth moving weights | A16W4 (weight-only) | Weights are what's moving; 4× less data movement; dequant arithmetic is free because the matrix unit would otherwise be idle |
| Compute-bound (large-context prefill, high-throughput serving) | Tensor Core FLOPS | A8W8 (activation quantization) | MMA itself is the bottleneck; dropping activation precision from 16-bit to 8-bit ≈ doubles MMA throughput; explicit dequant overhead eats into this and flips the economics |
Dropbox's empirical finding: A16W4 often performs worse than 16-bit matmul under compute-bound conditions (Figure 2 of the post) because the dequant cost outweighs the memory-bandwidth savings when bandwidth isn't the bottleneck (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
When to use which¶
- A16W4 wins — local deployments, single-user serving, small-batch inference, reasoning-heavy decoding, KV-cache-light workloads. The model's weights are the payload bandwidth-wise; shrinking them 4× is the main lever.
- A8W8 wins — multi-tenant API serving, large-context prefills, batch inference, long-document summarization. Tensor Cores are the bottleneck; narrowing activations roughly doubles throughput.
Why Dropbox runs both¶
Dropbox Dash is not one workload shape — the post names several distinct workloads (conversational AI, multimodal search, document understanding, speech processing) each with different latency-vs-throughput profiles. Running multiple quantization strategies on the same model-serving fleet lets Dropbox match the strategy to each workload's operating point rather than picking one for everything (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
How MXFP changes the trade-off¶
Once quantization runs under patterns/hardware-native-quantization via MXFP/NVFP, the software dequant step disappears — activations and weights at different precisions (e.g. MXFP8 × MXFP4) can be consumed by one fused MMA instruction. The compute-bound loss of A16W4 was fundamentally a dequant- overhead cost, not an inherent property of asymmetric precision, so mixed-precision MMAs on MXFP-capable hardware change the calculus.
The pattern stays useful even post-MXFP as a diagnostic frame ("am I memory-bound or compute-bound?") because it maps directly onto which axis to compress.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox's landscape survey. Figure 2 is the canonical visualization of A16W4 underperforming 16-bit MM under compute-bound regimes; the text explicitly ties the strategy choice to workload shape.