CONCEPT Cited by 2 sources

Context rot¶

Context rot is the empirically observed degradation of LLM agent accuracy as the context window fills up over a long-running investigation — especially when much of the added content is tool-call noise rather than task-relevant signal. The term was popularized by TryChroma's research and cited in Dropbox's context-engineering post as the forcing function that motivated Dash's architectural redesign (Source: sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai).

Concrete observation (Dash)¶

"We noticed that the overall accuracy of Dash degraded for longer-running jobs. The tool calls were adding a lot of extra context. We were seeing similar patterns of what's been popularized as context rot."

The failure mode isn't a single hard edge (like exceeding the window). It's a gradient: accuracy slips as accumulated context grows, well before the token limit is hit.

Why it happens (hypotheses)¶

Attention dilution. More tokens → each individual relevant token gets less attention weight.
Distractor accumulation. Each extraneous tool output is a decoy the model can latch onto.
Plan fragmentation. Long chains of tool calls mean the original intent is many turns back in the context; the model drifts.
Format noise. JSON-fat tool output consumes token budget and confuses the model more than equivalent CSV/YAML.

Mitigations observed in this wiki¶

The antidote is an architectural discipline (concepts/context-engineering) across several axes:

Shrink tool output. CSV over JSON for tabular data (~½ tokens/record per Datadog), YAML over JSON for nested (~−20%), default-trim rarely-used fields (Source: sources/2026-03-04-datadog-mcp-server-agent-tools).
Shrink tool inventory. Fewer, more flexible tools reduce both tool-description load and concepts/tool-selection-accuracy pressure (patterns/tool-surface-minimization, patterns/unified-retrieval-tool).
Shrink retrieved-content breadth. Precomputed relevance (graph + rankings) so only the top slice enters context (patterns/precomputed-relevance-graph).
Bound tool-call depth. Delegate complex sub-tasks to specialized agents with their own prompts + contexts, so the main agent's context doesn't grow linearly in sub-task verbosity (patterns/specialized-agent-decomposition).
Token-budget pagination instead of fixed-record pagination (Datadog: log records span 100 B to 1 MB; patterns/token-budget-pagination).

Decaying constraint?¶

Some clients are experimenting with writing long tool results to disk rather than inlining them into the window (Cursor's dynamic context discovery, Claude Code's recent changes cited in the Datadog post). If that becomes widespread the format-level token efficiency matters less, but structural levers (fewer tools, better ranking, specialized agents) don't decay — an agent still has to reason over the token-bearing artifacts.

Seen in¶

sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — named as the observed failure mode for Dash on long-running jobs; forcing function for context-engineering adoption.
sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Clemm's companion talk quantifies and reframes the same failure mode on the MCP side: "you're getting a lot of content back, and you're immediately going to fill up that context window. It's going to be very problematic. It's also incredibly slow. So, if you're using MCP with some agents today, even a simple query can take up to 45 seconds." The post also adds a new mitigation lever not previously in this wiki: store tool results locally, not in the context window (let the agent reference them by handle rather than inline them into the budget).
TryChroma research — source of the term as used in the industry.
sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — AI-vulnerability-research instance. Cloudflare's argument for a multi-stage harness over a single coding agent on a 100k-LoC repo names context rot directly as the load-bearing failure mode: "a single agent session (even with subagents) against a hundred-thousand-line repository can cover maybe a tenth of a percent of the surface in a useful way before the model's context window fills up and compaction kicks in — potentially discarding earlier findings that would have mattered." The "compaction discards earlier findings" observation is the vuln-research-specific consequence: findings accumulate across the session, and compaction can drop them. Generalises context rot from coding-agent / search-agent settings to the hypothesis-parallel work-class that vulnerability research represents. Architectural answer: external delegation across many fresh-context sessions (see concepts/single-agent-coverage-failure-on-large-repos, patterns/parallel-narrow-agents-over-exhaustive).