CONCEPT Cited by 1 source
Token-heavy system¶
Definition¶
A token-heavy system is an LLM-based application whose per-request token consumption is the load-bearing capacity and cost driver — large enough that feasibility, cost, and context-window fit must be modelled up front rather than discovered at runtime. It is a self-label Expedia's STAR team uses for their service (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).
The sibling concept concepts/context-window-as-token-budget is about allocation within a single call; token-heavy system is about aggregate token consumption across the multi-call workflow that makes up a single application request.
Why name it¶
Naming a system "token-heavy" forces three disciplines up front:
- Back-of-the-envelope estimation of the token bill. "We followed a systematic approach for back-of-the envelope estimation, grounded in facts, assumptions, and enforced limits" — Expedia.
- Enforced limits that make the estimate finite. The load-bearing STAR assumption: per-response cap of 4k tokens. Without the cap the worst-case is unbounded; with the cap, the whole chain's token envelope is a finite sum.
- Per-model context-window fit check. Window size varies by model and "has been increasing over time" — the fit check is re-done per model evaluated.
Typical shape¶
| Lever | What to size |
|---|---|
| Fixed prompts | System prompt + per-step chain prompts |
| Variable prompts | Prompts whose length depends on prior responses |
| Per-response output cap | Hard cap on model output length |
| Number of chain steps | Multiplies the per-step budget |
| Context window per model | Upper bound on cumulative context any single call can hold |
For a chained pipeline, the token envelope is computable statically once the per-step caps + step count are fixed. That computability is why token-heavy systems are feasible at all without runtime surprises.
Cost implications¶
Token consumption is not just a context-window fit question — it is a cost question. Token-heavy systems have aggregate cost that scales linearly with traffic × chain steps × average output length. Per-response caps therefore protect both context-window fit and cost envelope with a single lever.
Seen in¶
- Expedia STAR (2026-04-28) — canonical wiki instance. Expedia explicitly labels STAR a "token-heavy system" and walks their feasibility process: GPT-4o tokenizer for estimation, fixed + variable prompt lengths, 4k-token per-response cap as the anchor, per-model context-window re-check. First wiki canonicalisation of the token-heavy-system label as a first-class design category.
Related¶
- concepts/context-window-as-token-budget — per-call allocation; sibling at a lower altitude.
- concepts/context-engineering — the discipline of allocating context against the budget.
- concepts/back-of-the-envelope-estimation — the sizing methodology token-heavy systems require up front.
- concepts/prompt-chaining — the orchestration technique that makes token envelopes statically computable.
- systems/expedia-star — canonical wiki consumer.