CONCEPT Cited by 1 source
Back-of-the-envelope estimation¶
Definition¶
Back-of-the-envelope estimation (BOTE) is the engineering discipline of arriving at a rough-order-of-magnitude sizing number for a proposed system — capacity, cost, latency, token budget, storage footprint — from a structured combination of facts, assumptions, and enforced limits, before committing to the design.
The value is not the precision of the number; the value is making the assumptions explicit so they can be challenged, tracked, and re-visited when the inputs change. A design that doesn't have an envelope calculation behind it is a design whose feasibility is being discovered at runtime.
Canonical shape¶
- Facts. Known quantities (per-request payload size, model tokenizer output rates, broker throughput per core, disk seek latency).
- Assumptions. Best-guess numbers marked as such (expected QPS, per-customer fanout, cache hit rate).
- Enforced limits. Design-time caps that convert unbounded terms into finite ones (per-response cap, request timeout, max items per page).
The third step is the load-bearing one: without enforced limits, a worst-case estimate is unbounded; with them, the envelope computation is a finite sum.
Expedia STAR — canonical wiki instance¶
"Given STAR is a token-heavy system and in order to understand the feasibility and implications of this, we followed a systematic approach for back-of-the envelope estimation, grounded in facts, assumptions, and enforced limits. We estimated the number of tokens using OpenAI's GPT-4o tokenizer. ... To control the number of tokens we capped each response to 4k tokens. This number was then used for estimation purposes. Based on this analysis and the relatively static nature of the system, we concluded that we can accommodate the context window size." (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer)
Pulling the Expedia numbers apart:
- Facts — GPT-4o tokenizer output rates on Expedia's specific prompt templates.
- Assumptions — typical variable-prompt lengths conditional on expected prior responses.
- Enforced limits — 4k token cap per response.
The 4k cap is what made the estimate computable — it turns the "how much context will the chain accumulate?" question into a static arithmetic one.
Why this concept is worth naming¶
Most production systems are sized empirically — ship, measure, grow. LLM applications are different:
- Token cost scales linearly with traffic × chain steps × output length — running experiments to discover the bill is expensive.
- Context-window fit is a hard cliff — exceeding the window doesn't degrade, it fails.
- Models change — per-model context windows change over time; the estimate is re-done each time.
So LLM-era BOTE is both a cost question and a feasibility question. A team that hasn't done it is deferring decisions they can't afford to defer.
Seen in¶
- Expedia STAR (2026-04-28) — canonical wiki instance. STAR's feasibility process is walked explicitly: facts + assumptions + enforced-limits framework, GPT-4o tokenizer as the unit of measure, 4k per-response cap as the anchor, per-model context-window fit re-check. First wiki canonicalisation of BOTE applied to LLM-era token sizing.
Related¶
- concepts/token-heavy-system — the class of system that makes BOTE on tokens non-optional.
- systems/expedia-star — canonical wiki consumer.