CONCEPT Cited by 1 source

Token-heavy system¶

Definition¶

A token-heavy system is an LLM-based application whose per-request token consumption is the load-bearing capacity and cost driver — large enough that feasibility, cost, and context-window fit must be modelled up front rather than discovered at runtime. It is a self-label Expedia's STAR team uses for their service (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).

The sibling concept concepts/context-window-as-token-budget is about allocation within a single call; token-heavy system is about aggregate token consumption across the multi-call workflow that makes up a single application request.

Why name it¶

Naming a system "token-heavy" forces three disciplines up front:

Back-of-the-envelope estimation of the token bill. "We followed a systematic approach for back-of-the envelope estimation, grounded in facts, assumptions, and enforced limits" — Expedia.
Enforced limits that make the estimate finite. The load-bearing STAR assumption: per-response cap of 4k tokens. Without the cap the worst-case is unbounded; with the cap, the whole chain's token envelope is a finite sum.
Per-model context-window fit check. Window size varies by model and "has been increasing over time" — the fit check is re-done per model evaluated.

Typical shape¶

Lever	What to size
Fixed prompts	System prompt + per-step chain prompts
Variable prompts	Prompts whose length depends on prior responses
Per-response output cap	Hard cap on model output length
Number of chain steps	Multiplies the per-step budget
Context window per model	Upper bound on cumulative context any single call can hold

For a chained pipeline, the token envelope is computable statically once the per-step caps + step count are fixed. That computability is why token-heavy systems are feasible at all without runtime surprises.

Cost implications¶

Token consumption is not just a context-window fit question — it is a cost question. Token-heavy systems have aggregate cost that scales linearly with traffic × chain steps × average output length. Per-response caps therefore protect both context-window fit and cost envelope with a single lever.

Seen in¶

Expedia STAR (2026-04-28) — canonical wiki instance. Expedia explicitly labels STAR a "token-heavy system" and walks their feasibility process: GPT-4o tokenizer for estimation, fixed + variable prompt lengths, 4k-token per-response cap as the anchor, per-model context-window re-check. First wiki canonicalisation of the token-heavy-system label as a first-class design category.

concepts/context-window-as-token-budget — per-call allocation; sibling at a lower altitude.
concepts/context-engineering — the discipline of allocating context against the budget.
concepts/back-of-the-envelope-estimation — the sizing methodology token-heavy systems require up front.
concepts/prompt-chaining — the orchestration technique that makes token envelopes statically computable.
systems/expedia-star — canonical wiki consumer.