SYSTEM Cited by 1 source

NVIDIA Tensor Core¶

Definition¶

NVIDIA Tensor Cores are dedicated matrix multiply-accumulate (MMA) units embedded in NVIDIA GPUs, accessed through MMA instructions (mma.sync, tcgen05.mma, and variants). They accelerate the dense matrix math that dominates attention-based model inference, delivering substantial throughput gains over general-purpose CUDA Cores. AMD's architectural counterpart is Matrix Cores (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why they matter¶

In attention-based models, most compute lives in repeated matrix multiplications in two places (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai):

Linear layers — projections inside attention blocks + MLP layers + the final output stage.
Attention mechanism — pairwise token-relationship computation; cost scales with context length.

Both are Tensor Core workloads via MMA instructions. A Dropbox-scale inference deployment (Dash, multimodal search, conversational AI) is effectively bottlenecked by Tensor Core throughput.

Precision scaling — the core economic engine¶

A notable property of Tensor Cores is throughput scaling with precision: as numerical precision halves, these cores can perform roughly twice as many matrix operations per second. The same silicon delivers (approximately):

FP16 → baseline
FP8 → ~2× FLOPS
FP4 → ~4× FLOPS

Figure 1 of the Dropbox post plots this across NVIDIA RTX 6000 variants. This is the economic engine behind concepts/low-bit-inference: lower precision → more matrix ops per second → cheaper tokens per dollar, cheaper tokens per watt.

Instruction-level variants¶

Different compute capabilities expose MMA through different instructions:

Architecture	MMA instruction	MXFP / NVFP support
Hopper (sm_90)	`wgmma`	FP8 via `e4m3` / `e5m2`
Blackwell datacenter (sm_100)	`tcgen05.mma`	MXFP via `block_scale` modifier
sm_120	`mma.sync` with `block_scale`	MXFP/NVFP support added via Triton

Kernels compiled for sm_100 are not portable to sm_120 because of instruction-level differences — a real engineering constraint for cross-device support (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The `block_scale` modifier¶

For MXFP and NVFP4 formats, the MMA instructions grow a block_scale modifier that makes the instruction operate directly on:

Packed low-bit operands (bitpacked)
Per-block shared-scaling metadata (E8M0 for MXFP, E4M3 for NVFP4)
In-place multiply-accumulate

No software dequant step, no per-element scale multiply — patterns/hardware-native-quantization in action.

Attention-side quantization¶

Tensor Cores aren't just used for linear layers. Methods like Flash Attention 3 and Sage Attention push 8-bit quantization into the attention mechanism itself, using Tensor Cores for the attention-side matrix multiplications with minimal accuracy impact. This matters for long-context workloads like Dash document understanding, where attention compute grows with input length (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Limits¶

Precision scaling doesn't extend to binary/ternary. BitNet- class weights (2–3 levels) don't map onto Tensor Cores — they need custom / specialized accelerators; the Dropbox post cites experimental efforts but notes this path hasn't seen broad industry adoption due to limited ecosystem support and model quality concerns.
Kernel portability is architecture-specific. Cross-sm compilation is a real operational cost; Triton's recent sm_120 MXFP support helps, but ecosystem-wide FP4 support is still uneven.
Format support ≠ widely-available models. FP4-trained models are still scarce, so even when Tensor Cores support FP4 natively, the inputs to run at FP4 may not exist for a given model family.

Relationship to Dropbox's hardware¶

Gumby (flexible GPU tier) and Godzilla (dense multi-GPU tier) in Dropbox's 7th-gen hardware rollout are the production substrates where Tensor Core capability directly translates into Dash's latency/cost targets. The quantization strategy stack discussed in the 2026-02-12 post sits on top of this hardware.

Seen in¶

sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical treatment of Tensor Cores as the hardware substrate for attention-model inference; Figure 1 showing FLOPS scaling across precisions on RTX 6000 variants; the sm_100 vs sm_120 instruction-level portability caveat; block_scale modifier as the enabling primitive for MXFP / NVFP adoption.
sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical instance of the "tensor cores ~600× faster than HBM" bandwidth-bound framing; Hopper wgmma + TMA + 228 KB SMEM budget are the concrete primitives Unweight's reconstructive matmul uses. Sibling lossless path to the lossy hardware-native quantization direction.