Skip to content

CONCEPT Cited by 1 source

Token-heavy system

Definition

A token-heavy system is an LLM-based application whose per-request token consumption is the load-bearing capacity and cost driver — large enough that feasibility, cost, and context-window fit must be modelled up front rather than discovered at runtime. It is a self-label Expedia's STAR team uses for their service (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).

The sibling concept concepts/context-window-as-token-budget is about allocation within a single call; token-heavy system is about aggregate token consumption across the multi-call workflow that makes up a single application request.

Why name it

Naming a system "token-heavy" forces three disciplines up front:

  1. Back-of-the-envelope estimation of the token bill. "We followed a systematic approach for back-of-the envelope estimation, grounded in facts, assumptions, and enforced limits" — Expedia.
  2. Enforced limits that make the estimate finite. The load-bearing STAR assumption: per-response cap of 4k tokens. Without the cap the worst-case is unbounded; with the cap, the whole chain's token envelope is a finite sum.
  3. Per-model context-window fit check. Window size varies by model and "has been increasing over time" — the fit check is re-done per model evaluated.

Typical shape

Lever What to size
Fixed prompts System prompt + per-step chain prompts
Variable prompts Prompts whose length depends on prior responses
Per-response output cap Hard cap on model output length
Number of chain steps Multiplies the per-step budget
Context window per model Upper bound on cumulative context any single call can hold

For a chained pipeline, the token envelope is computable statically once the per-step caps + step count are fixed. That computability is why token-heavy systems are feasible at all without runtime surprises.

Cost implications

Token consumption is not just a context-window fit question — it is a cost question. Token-heavy systems have aggregate cost that scales linearly with traffic × chain steps × average output length. Per-response caps therefore protect both context-window fit and cost envelope with a single lever.

Seen in

  • Expedia STAR (2026-04-28) — canonical wiki instance. Expedia explicitly labels STAR a "token-heavy system" and walks their feasibility process: GPT-4o tokenizer for estimation, fixed + variable prompt lengths, 4k-token per-response cap as the anchor, per-model context-window re-check. First wiki canonicalisation of the token-heavy-system label as a first-class design category.
Last updated · 433 distilled / 1,256 read