Skip to content

CONCEPT Cited by 1 source

Context window as token budget

Definition

The context window supplied to an LLM call is a fixed token budget. Every input the program keeps in that window — user messages, assistant replies, tool descriptions (their JSON schemas), tool outputs, system prompts, summaries — competes for the same limited token space. Past a threshold, "the whole system begins getting nondeterministically stupider." (Source: sources/2025-11-06-flyio-you-should-write-an-agent.)

The Fly.io post's canonical phrasing: "You're allotted a fixed number of tokens in any context window. Each input you feed in, each output you save, each tool you describe, and each tool output eats tokens (that is: takes up space in the array of strings you keep to pretend you're having a conversation with a stateless black box)."

Why the framing matters

Three practical consequences follow from treating the window as a budget rather than as a memory:

  • Tool descriptions are a line item. Every tool you expose sits in context on every turn, costing tokens before any tool is actually called. A bristling inventory of 50 tool schemas can leave no room to get work done — Fly.io names this directly as the driver of the tool-selection-accuracy + tool-surface-minimization discipline. Datadog's MCP-server retrospective independently confirmed the same failure mode (Source: sources/2026-03-04-datadog-mcp-server-agent-tools).
  • Tool outputs eat budget too. A large tool output (a dumped log file, a stack trace, a search-result JSON blob) pushed into context stays there for the rest of the session. This motivates summarisation-as-compression, output-to-file-not-to-context (patterns/untrusted-input-via-file-not-prompt at Datadog was partly motivated by the same budget concern), and per-agent context isolation via sub-agents (patterns/context-segregated-sub-agents).
  • The degradation is nondeterministic. There is no hard cliff at exactly N tokens. Quality degrades as more tokens pile up, and the failure mode looks like the model getting "stupider" — missing instructions, ignoring earlier context, hallucinating facts it was told. See concepts/context-rot for the related observation that accuracy decays well before the stated window limit.

Programming implications

Treating context as a budget turns "context engineering" into a legible programming problem (concepts/context-engineering):

  • Allocate explicitly. Decide up-front how many tokens go to system prompts, tool schemas, conversation history, and tool-output headroom. When the budget runs tight, compress the oldest slice (summarise a sub-conversation into a paragraph).
  • Keep the array small. The "conversation" is a Python list of strings (concepts/agent-loop-stateless-llm); you can filter, compress, re-order, truncate, and splice in summaries — it's just data.
  • Split contexts. Spawn a sub-agent with its own fresh context array for work that doesn't need to sit in the main agent's budget. Return a summary up, not the raw transcript.
  • Trim tool descriptions to what this turn needs. Some agents swap tool allowlists between planning and execution phases so only the relevant subset is in-window at a time.

Seen in

Last updated · 200 distilled / 1,178 read