Skip to content

SYSTEM Cited by 1 source

NVIDIA Tensor Core

Definition

NVIDIA Tensor Cores are dedicated matrix multiply-accumulate (MMA) units embedded in NVIDIA GPUs, accessed through MMA instructions (mma.sync, tcgen05.mma, and variants). They accelerate the dense matrix math that dominates attention-based model inference, delivering substantial throughput gains over general-purpose CUDA Cores. AMD's architectural counterpart is Matrix Cores (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why they matter

In attention-based models, most compute lives in repeated matrix multiplications in two places (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai):

  1. Linear layers — projections inside attention blocks + MLP layers + the final output stage.
  2. Attention mechanism — pairwise token-relationship computation; cost scales with context length.

Both are Tensor Core workloads via MMA instructions. A Dropbox-scale inference deployment (Dash, multimodal search, conversational AI) is effectively bottlenecked by Tensor Core throughput.

Precision scaling — the core economic engine

A notable property of Tensor Cores is throughput scaling with precision: as numerical precision halves, these cores can perform roughly twice as many matrix operations per second. The same silicon delivers (approximately):

  • FP16 → baseline
  • FP8 → ~2× FLOPS
  • FP4 → ~4× FLOPS

Figure 1 of the Dropbox post plots this across NVIDIA RTX 6000 variants. This is the economic engine behind concepts/low-bit-inference: lower precision → more matrix ops per second → cheaper tokens per dollar, cheaper tokens per watt.

Instruction-level variants

Different compute capabilities expose MMA through different instructions:

Architecture MMA instruction MXFP / NVFP support
Hopper (sm_90) wgmma FP8 via e4m3 / e5m2
Blackwell datacenter (sm_100) tcgen05.mma MXFP via block_scale modifier
sm_120 mma.sync with block_scale MXFP/NVFP support added via Triton

Kernels compiled for sm_100 are not portable to sm_120 because of instruction-level differences — a real engineering constraint for cross-device support (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The block_scale modifier

For MXFP and NVFP4 formats, the MMA instructions grow a block_scale modifier that makes the instruction operate directly on:

  • Packed low-bit operands (bitpacked)
  • Per-block shared-scaling metadata (E8M0 for MXFP, E4M3 for NVFP4)
  • In-place multiply-accumulate

No software dequant step, no per-element scale multiply — patterns/hardware-native-quantization in action.

Attention-side quantization

Tensor Cores aren't just used for linear layers. Methods like Flash Attention 3 and Sage Attention push 8-bit quantization into the attention mechanism itself, using Tensor Cores for the attention-side matrix multiplications with minimal accuracy impact. This matters for long-context workloads like Dash document understanding, where attention compute grows with input length (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Limits

  • Precision scaling doesn't extend to binary/ternary. BitNet- class weights (2–3 levels) don't map onto Tensor Cores — they need custom / specialized accelerators; the Dropbox post cites experimental efforts but notes this path hasn't seen broad industry adoption due to limited ecosystem support and model quality concerns.
  • Kernel portability is architecture-specific. Cross-sm compilation is a real operational cost; Triton's recent sm_120 MXFP support helps, but ecosystem-wide FP4 support is still uneven.
  • Format support ≠ widely-available models. FP4-trained models are still scarce, so even when Tensor Cores support FP4 natively, the inputs to run at FP4 may not exist for a given model family.

Relationship to Dropbox's hardware

Gumby (flexible GPU tier) and Godzilla (dense multi-GPU tier) in Dropbox's 7th-gen hardware rollout are the production substrates where Tensor Core capability directly translates into Dash's latency/cost targets. The quantization strategy stack discussed in the 2026-02-12 post sits on top of this hardware.

Seen in

Last updated · 200 distilled / 1,178 read