Skip to content

PATTERN Cited by 1 source

Token-limit-aware feature prioritization

When the input fed to a language model can exceed the model's context window, order the input features by importance before serialisation so that if truncation happens, the least-important features are the ones dropped. Don't compress, don't summarise, don't hash — just order, then let the tokenizer's truncation be the filter.

Intent

Real-world LLM-over-structured-data applications routinely generate prompts that dwarf any plausible context window. The three standard responses are:

  1. Compress / summarise the oversized input with another model before feeding.
  2. Embed + retrieve (RAG) to pull only the top-k most similar chunks.
  3. Truncate — hit the window limit, drop whatever's past it.

Compression and retrieval both introduce a second model and a second failure mode (summariser hallucination, retriever misranking). Naïve truncation fails predictably: whatever the pipeline put last is what gets dropped.

The prioritisation pattern is pure truncation made safe — move truncation's cut point to where it does the least damage by putting the most important content first.

Mechanism

  1. Rank features by importance for the target task — classical feature-importance scoring, domain heuristics, or learned from validation-set ablations.
  2. Serialise the structured input (YAML / JSON / prose) in importance-descending order.
  3. Feed the whole serialised string to the tokenizer.
  4. Truncate at the context-window boundary — only low- importance tail features are lost.

No extra model, no index, no embedding pipeline — the tokenizer's built-in truncation does the work.

Canonical wiki instance

Google's 2025-07-29 RLM post is the canonical wiki instance. Each Borg cluster-state data point (x) can carry up to 1M tokens of candidate features (active jobs, execution traces, textual metadata, hardware descriptors, config). The RLM's context window is 8k tokens. Google's pre-processing step reorders features so the most important ones come first — "when the string is truncated to fit the token limit, only the less important features are lost" (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

When it beats compression / retrieval

  • When the important features are finite and knowable. If you can compute an importance ranking once, you don't need a second model to decide what's relevant per query.
  • When a full-string tokenizer view is what the model expects. Summaries break the LM's expectation about the input surface; truncation leaves the surface identical up to the cut point.
  • When determinism matters. Truncation at a fixed byte offset is deterministic; summariser / retriever choices are not.
  • When inference latency is load-bearing. No extra model call, no vector-DB query — just a substring operation.

When retrieval or compression is better

  • Per-query relevance ranking varies sharply. If the most important 8k tokens are different for every query, a static importance ordering is the wrong choice; retrieval wins.
  • Features have nested / referential structure. Flattening a graph into a linear stream already loses information that a retrieval-over-graph approach could preserve.
  • Context ordering affects semantics. If the LM expects chronological or dependency-ordered content, importance-first ordering breaks that contract.

Seen in

Last updated · 200 distilled / 1,178 read