Skip to content

SYSTEM Cited by 1 source

EAGLE-3 (drafter model)

Overview

EAGLE-3 (Extrapolation Algorithm for Greater Language-model Efficiency, version 3) is NVIDIA's drafter model for speculative decoding, packaged per target model as a small companion network that produces candidate next-token sequences for the expert model to verify in parallel. Cloudflare Workers AI uses the Kimi-K2.5-specific variant nvidia/Kimi-K2.5-Thinking-Eagle3 as the drafter for Kimi K2.5. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Role

Speculative decoding needs a small, fast drafter that shares the target model's tokenizer + vocabulary so draft tokens can be parallel-verified in one expert forward pass. EAGLE is the lineage of drafters that NVIDIA trains and distributes for popular target models. "To do this with Kimi K2.5, we leverage NVIDIA's EAGLE-3 (Extrapolation Algorithm for Greater Language-model Efficiency) draft model."

EAGLE's design inclines (from the broader literature, not all in the Cloudflare post):

  • Layer-aware draft generation — the drafter tap points come from early/mid layers of the target model rather than being an independent model.
  • Training-free sampling discipline compatible with the target's rejection-sampling rules.
  • High acceptance rate at modest N (number of tokens drafted per verify-pass) on typical instruction-following / chat traffic.

The raw post does not reproduce the internal architecture beyond the link to the Hugging Face model card; this wiki page does not reconstruct it from external sources.

Why it shines on agentic workloads (per the post)

"In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope."

The drafter's acceptance rate is high on structurally predictable generations: JSON envelopes, tool-call schemas, MCP response formats. Each accepted token is a full expert-pass avoided, so high-acceptance regions compound into large end-to-end speedups.

Tuning knob

"The levers for tuning speculative decoding include the number of future tokens to generate."

N = draft length per verify-pass is the key tuning dial: - N too small → verify-pass overhead dominates, little win. - N too large → verify-pass cost grows + most drafts rejected early, compute wasted. - Optimal N depends on the drafter/expert pair's empirical acceptance distribution on the workload.

Cloudflare does not disclose their chosen N or observed acceptance rates. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Caveats

  • No acceptance rate disclosed. "Shines on agentic workloads" is qualitative.
  • No end-to-end speedup number isolated to EAGLE-3. The 3× Kimi K2.5 speedup post-launch combines PD disaggregation + session affinity + speculative decoding + configuration tuning.
  • EAGLE internal design not reproduced in this page — wiki bounds itself to what the source discloses + the Hugging Face model card link.
  • Only the Kimi-K2.5 variant is referenced. EAGLE-3 exists for other target models but this post is scoped to Cloudflare's use.

Relationship to other decoding-time primitives

  • Speculative decoding — the family of inference-time latency optimisations EAGLE-3 is a drafter for.
  • Drafter-expert split — the architectural primitive.
  • Token verification — the parallel-forward-pass primitive EAGLE-3 outputs feed.
  • Draft-verify inference — the generalised cheap-proposer + expensive-verifier pattern.
  • Speculative cascades (Google Research, 2025-09-11) — generalisation of the rejection rule from token-exact to probabilistic match; EAGLE-3 in Cloudflare's stack uses the canonical token-exact form as of 2026-04.

Seen in

Last updated · 200 distilled / 1,178 read