Skip to content

CONCEPT Cited by 1 source

Logical component grouping for context budget

Definition

Logical component grouping for context budget is the discipline of partitioning a large transformation corpus into small, semantically-coherent groups such that any single LLM invocation's prompt stays inside an empirically- measured accuracy-sweet-spot band of the context window — not the provider's hard maximum, but the point beyond which accuracy starts declining.

The primitive

Zalando's Partner Tech team observed — during the toolkit's scale-up from hackathon to production — that "as the input prompt size grew, the transformation accuracy declined". Rather than shrinking the per-file prompt, they shrank the scope: components are partitioned into logical groups ("form, core, etc.") and each group's combined prompt (interface + mapping + examples for every component in the group) is kept between 40–50K tokens. The toolkit is invoked once per group per file.

Why 40–50K instead of the model's hard limit

GPT-4o's context window is much larger than 50K tokens, and OpenAI's documentation does not name a specific accuracy cliff. The band is empirical — Zalando measured accuracy against a validation set and found it declined beyond this zone for their workload. The hard limit is a truncation threshold; the sweet-spot band is an accuracy threshold and varies by task complexity, model, and prompt shape.

The accuracy-vs-context-length curve

Not formally disclosed, but the qualitative shape Zalando describes:

accuracy
  │    ░░░░░░░░░░░░░░
  │  ░░              ░░░
  │ ░                   ░░░░
  │░                        ░░░░░░░
  └──┬─────────┬──────────────────────> context tokens
     ~20K   ~50K    hard limit (~128K+)
    warmup  sweet    gradual decline
            spot

Three regions:

  1. Warmup (below ~20K). Prompt may under-specify the task; accuracy suffers because the model has too little structure.
  2. Sweet spot (~40–50K for Zalando). Enough structure, few enough tokens that attention isn't diluted.
  3. Decline (past ~50K). Accuracy degrades continuously to the hard limit, for reasons grouped under context rot: attention dilution, position-encoding degradation, needle-in- haystack recall failures, and prompt-injection style interference between sections.

Design heuristics for grouping

  1. Group by domain cohesion, not alphabetic / size. Zalando's example groupings ("form, core") suggest semantic clusters: components that are likely to appear in the same file are in the same group. This also means the relevant interface+mapping+examples for a given file are concentrated in the group prompt rather than spread across groups.
  2. Size groups to fit the budget, not the library. A library of 30 components might split into 10 groups of 3 (Zalando's rough breakdown) if per-component prompt payload is ~15K tokens; or into 3 groups of 10 if per-component payload is ~5K tokens.
  3. Empirically locate the cliff per workload. 40–50K for Zalando is not portable; it's a function of GPT-4o in 2024, prompt shape, and component complexity. Treat as a tunable parameter, not a universal constant.
  4. Run one group at a time per file. Files that use components from multiple groups will be visited by multiple toolkit invocations; the partial transforms compose cleanly when the transformations are orthogonal (operating on different components). Zalando does not describe cross-group conflicts.
  • vs context window as token budget: the parent concept is about fitting inside the hard limit. This concept is about staying inside the accuracy-sweet-spot band, well below the hard limit.
  • vs concepts/context-engineering: context engineering is the agent-altitude parent discipline (allocating budget across tool descriptions, history, outputs). This concept is the single-shot-migration specialisation.
  • vs concepts/context-rot: context rot is the failure mode this concept avoids. Grouping is the operational response to the rot curve.

Seen in

Last updated · 501 distilled / 1,218 read