Skip to content

PATTERN Cited by 1 source

Weight-only vs activation quantization

Pattern

When deploying quantized attention models on GPU Tensor Cores, choose between weight-only quantization (A16W4-style: high-precision activations, low-bit weights) and activation quantization (A8W8-style: both activations and weights at 8-bit) based on whether the inference workload is memory-bound or compute-bound. Neither is a universal win (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why

Under pre-MXFP formats, when activations and weights use different bit-widths, weights must be explicitly dequantized up to activation precision before the MMA. The MMA still runs at the higher activation precision — the dequant is pure arithmetic overhead on the MMA path.

This cost has opposite sign in different regimes:

Regime Dominant constraint Winning strategy Why
Memory-bound (small batch, reasoning, decoding) HBM bandwidth moving weights A16W4 (weight-only) Weights are what's moving; 4× less data movement; dequant arithmetic is free because the matrix unit would otherwise be idle
Compute-bound (large-context prefill, high-throughput serving) Tensor Core FLOPS A8W8 (activation quantization) MMA itself is the bottleneck; dropping activation precision from 16-bit to 8-bit ≈ doubles MMA throughput; explicit dequant overhead eats into this and flips the economics

Dropbox's empirical finding: A16W4 often performs worse than 16-bit matmul under compute-bound conditions (Figure 2 of the post) because the dequant cost outweighs the memory-bandwidth savings when bandwidth isn't the bottleneck (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

When to use which

  • A16W4 wins — local deployments, single-user serving, small-batch inference, reasoning-heavy decoding, KV-cache-light workloads. The model's weights are the payload bandwidth-wise; shrinking them 4× is the main lever.
  • A8W8 wins — multi-tenant API serving, large-context prefills, batch inference, long-document summarization. Tensor Cores are the bottleneck; narrowing activations roughly doubles throughput.

Why Dropbox runs both

Dropbox Dash is not one workload shape — the post names several distinct workloads (conversational AI, multimodal search, document understanding, speech processing) each with different latency-vs-throughput profiles. Running multiple quantization strategies on the same model-serving fleet lets Dropbox match the strategy to each workload's operating point rather than picking one for everything (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

How MXFP changes the trade-off

Once quantization runs under patterns/hardware-native-quantization via MXFP/NVFP, the software dequant step disappears — activations and weights at different precisions (e.g. MXFP8 × MXFP4) can be consumed by one fused MMA instruction. The compute-bound loss of A16W4 was fundamentally a dequant- overhead cost, not an inherent property of asymmetric precision, so mixed-precision MMAs on MXFP-capable hardware change the calculus.

The pattern stays useful even post-MXFP as a diagnostic frame ("am I memory-bound or compute-bound?") because it maps directly onto which axis to compress.

Seen in

Last updated · 200 distilled / 1,178 read