Skip to content

PATTERN Cited by 1 source

Hardware-software codesign for ML serving

Hardware-software codesign for ML serving is the design discipline of choosing or tuning ML-serving algorithms (decoding strategies, batching policies, attention implementations, quantisation choices) jointly with the hardware substrate's characteristics (matrix-engine shape, on-chip-memory size and bandwidth, parallel-compute width, interconnect topology) rather than treating algorithm and hardware as orthogonal layers.

Why it matters

A serving algorithm chosen on theoretical-acceptance-rate or throughput grounds, then naively ported to a substrate, almost always leaves performance on the table. The substrate has its own structural preferences:

  • TPU-style architectures prefer regular dense matmul shapes with predictable memory-access patterns.
  • GPU-style architectures with attention-specialised kernels (FlashAttention, etc.) have different cache-friendly access shapes.
  • HBM-bandwidth-bound workloads (LLM decoding is dominantly memory-bound on long contexts) reward algorithms that amortise weight loads across more useful work per pass.

The codesign discipline asks: "given this substrate's compute / memory / interconnect profile, which algorithmic shape extracts the most useful tokens per second?" — and tunes the algorithm to fit, rather than choosing the algorithm in substrate-blind isolation.

Canonical wiki instance

The 2026-05-28 Google Research I/O 2026 roundup post is the canonical-wiki instance of this pattern at the LLM-serving layer. Google's speculative- decoding extensions — block verification and tree- structured drafting — are explicitly described as codesigned with Google's TPU architecture:

"Our implementation is highly optimized for Google's TPU architecture, maximizing hardware utilization to deliver substantially faster responses with no loss in quality. This work enabled the current speed of Gemini 3.5 Flash." (Source: sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026)

The codesign claim has two structural moves:

  • Algorithm-side: choose tree-shaped drafting + block-level verification because these algorithmic shapes map cleanly to TPU matrix-multiply hardware over shared-prefix KV-cache state.
  • Substrate-side: tune kernel layouts, pod sharding, and compiler scheduling for the resulting tree+block computation pattern.

The result is a Gemini 3.5 Flash serving stack whose speed claim depends on both the algorithm and the substrate — neither in isolation.

Sibling instances on the wiki

  • systems/eagle-3 as drafter for Kimi K2.5 on Cloudflare Workers AI is a codesign at the drafter-target-vocabulary axis — EAGLE-3 is trained per target model precisely so the vocabularies match and parallel verification can run in one pass on the available GPU substrate.
  • systems/coral-npu is a codesign instance one layer below — the silicon itself is designed "reversing traditional chip design… prioritizing the ML matrix engine over scalar compute" (ML-first architecture) so that the resulting chip and the ML workloads it runs match by construction rather than by retrofit.

Failure modes

  • Algorithm-substrate mismatch — choosing an algorithm optimal on substrate A and shipping it on substrate B; the speed-up vanishes or reverses.
  • Premature codesign lock-in — over-tuning to the current substrate generation makes migration to next-generation hardware expensive.
  • Codesign opacity — the "X% faster" claim is not portable without disclosing the substrate; reproducing the speed-up on different hardware requires re-tuning.

Seen in

Last updated · 542 distilled / 1,571 read