PATTERN Cited by 1 source

Autotuned execution pipeline selection¶

Statement¶

When a workload has multiple viable execution pipelines with orthogonal performance trade-offs (e.g. preprocessing cost vs per-element work), don't hard-code one. Offer the spectrum and let a runtime autotuner sweep candidate configurations against measured end-to-end throughput on the target hardware — per workload unit, per input shape, per batch size.

When it applies¶

Candidate pipelines exist, each with different bottleneck profiles across input size / batch size / weight shape.
The cost model is non-trivial — heuristics fail because real performance depends on cross-cutting factors (cache behaviour, SMEM budget collisions, kernel launch overhead, compute-unit occupancy) that don't compose from first principles.
End-to-end throughput is measurable at the target deployment (real GPU, real model, realistic workload).

Problem¶

Hardcoding one strategy — "use cuBLAS for the matmul always", "always decompress everything before matmul", "always use the custom reconstructive kernel" — loses on the workloads the chosen strategy is wrong for. But hand-tuning per workload doesn't scale across the matrix of (model architecture) × (weight shape) × (batch size) × (hardware generation).

Solution¶

Enumerate a small number of pipelines covering the trade-off spectrum; build a runtime that can select dynamically; add an autotuner that measures actual end-to-end throughput on-target and picks the best configuration per work unit.

Unweight's four-pipeline spectrum¶

Four pipelines spanning the preprocessing-cost-vs-matmul-work trade-off:

Pipeline	Preprocess	Matmul kernel	Preprocess HBM writes	Matmul work
Full decode	Huffman → full BF16 in HBM	stock cuBLAS	most	least
Exponent-only	decode exponent → HBM	reconstructive	½ of full	medium
Palette transcode	transcode to 4-bit palette → HBM	reconstructive	¼ of full	medium
Direct palette	none (palette pre-baked at load)	reconstructive	0	most

Autotuner procedure¶

"Sweep candidate configurations for the gate projection while holding up and down fixed, then sweep up, then down, repeating until no further improvement is found. The result is a per-model configuration file that tells the runtime exactly which pipeline, matmul variant, and SM allocation to use for each projection at each batch size — all driven by measured performance rather than heuristics."

Sweep dimensions:

Which of the four pipelines
Matmul-kernel variant (output tile width, circular-buffer depth)
SM partition between decode and matmul

Output: a per-(model, hardware) configuration file that the runtime dispatches against at serving time.

Why no single pipeline wins¶

"There's no single best way to use compressed weights during inference. The right approach depends on the workload — the batch size, the shape of the weight matrix, and how much GPU time is available for decompression."

Small batch (1–64 tokens) → matmul is small, kernel-launch overhead dominates, full decode + cuBLAS usually wins.
Large batch (256+ tokens) → matmul runs long enough to absorb reconstruction work, palette / exponent pull ahead because preprocess overhead fades and bus-freed-sooner matters.
Different weight matrices within the same layer (gate / up / down) — different shapes → different matmul-tile geometries → different optimal pipeline.

Canonical wiki instances¶

Unweight (2026-04-17) — four pipelines, empirical autotuning, per-projection-per-batch-size config file. Canonical wiki instance.
Infire (Cloudflare, 2026-04-16) — analogous runtime shape: different inference strategies selected per workload via measurement; sibling in the Workers AI stack.

Sibling patterns¶

patterns/measurement-driven-micro-optimization — same discipline at a smaller granularity (single function / single kernel); autotuned-pipeline-selection is measurement-driven optimization applied at the pipeline grain.
patterns/fused-decompress-tensor-core-matmul — the kernel shape that three of Unweight's four pipelines share.
patterns/sm-partitioning-producer-consumer — one of the autotuned knobs inside those three pipelines.

Trade-offs¶

Autotuning is expensive — sweeping the full (pipeline × variant × SM split) × (model × weight matrix × batch size) grid takes real GPU time. Unweight amortises by producing a static config file per (model, hardware) rather than re-autotuning per request.
Empirical results don't generalise across hardware generations — a Hopper autotune doesn't carry to Blackwell due to different MMA instructions + different SMEM budget + different bandwidth ratios.
Honest framing required — autotuner finds the best of the offered pipelines, not globally optimal. Adding a fifth pipeline may unlock additional headroom that no autotune over four could reach.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance.

systems/unweight — production deployment.
patterns/measurement-driven-micro-optimization — sibling pattern at smaller grain.
patterns/fused-decompress-tensor-core-matmul, patterns/sm-partitioning-producer-consumer — patterns composed under this one.
concepts/memory-bandwidth-bound-inference — the regime where the pipeline trade-offs are load-bearing.