CONCEPT Cited by 1 source

LLM model drift¶

Definition¶

LLM model drift is the failure mode where a deployed language model's behaviour changes over time — producing different (and often worse) outputs for the same prompt after weeks or months, even without an announced model update. The verbatim framing from Peter Corless (Redpanda, 2026-01-13):

"we've seen that models can actually degrade in performance with usage over time, even over the span of a few months." (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls)

The post cites arXiv:2307.09009 for the empirical observation. Corless also surfaces a second instance of generation-over-generation drift: GPT-5.1 produced slightly worse results than GPT-5.0 on some evaluations — capability can regress as models are updated.

The API-contract analogy¶

The load-bearing framing in the post is that LLMs break a basic expectation held for every prior serving system:

"Unlike an API, if you fire it up and it works in January, it should still work in June, and provide the same, correct answer each time. With LLMs, each answer is a special snowflake, and those snowflakes can melt over time."

Two separate failure modes are named in the same framing:

Per-call non-determinism — "each answer is a special snowflake" — the stochastic-decoding behaviour that makes the same prompt produce different outputs on different calls even within the same model version.
Behavioural drift over time — "those snowflakes can melt over time" — the same prompt-same-model pair starting to produce measurably different or worse outputs as weeks/months pass.

These two are structurally distinct from classic failure modes:

Distinct from LLM hallucination — which is about factual wrongness in a single output. Drift is about behavioural change over time or versions, even for prompts that don't involve factual claims.
Distinct from API-surface breaking changes — the API contract (inputs, outputs, return shape) can be unchanged while the model behind it drifts.

Why this is a systems-infrastructure problem¶

Corless's framing — "AI models need to be periodically recalibrated" — positions drift as an ongoing operational burden unique to LLM-in-production deployments. Concrete consequences for serving infrastructure:

Regression tests are non-deterministic. Golden-output tests don't survive non-determinism; evaluation has to be statistical (pass-rate over a suite) not deterministic.
Performance contracts are stochastic. Promising a specific capability level to a downstream consumer is harder than promising an API's latency SLO.
Version-pinning doesn't fully fix drift. The Corless post notes drift "with usage over time" — not only across announced version changes.
Recalibration is a continuous cost. Evaluation, eval-set maintenance, and re-tuning cadence become part of the production-LLM TCO.

The Corless post defers specific streaming-infrastructure remediation to Parts 3-4 of the series ("AI observability and evaluation" and "Real-time streaming & AI"). Part 1 names drift as a capability gap, not a solution.

Caveats¶

Empirical claim is single-sourced to arXiv 2307.09009. The 2307.09009 paper (Chen, Zaharia, Zou — "How Is ChatGPT's Behavior Changing over Time?", Stanford/UC Berkeley, 2023) documented drift empirically; subsequent commentary has debated measurement methodology and the degree to which the reported regressions reflect model changes vs evaluation-harness artifacts. Treat the magnitude as contested.
GPT-5.1 < GPT-5.0 claim cites a YouTube video. Corless links "some recent evaluations" to a single video URL; the wiki treats the GPT-5.1 regression as reported-but-not- replicated-from-a-rigorous-benchmark.
Drift mechanisms are not unpacked. Possible causes — RLHF reward-hacking, safety-tuning tradeoffs, rate-limiting that changes effective context use, serving-stack optimisations that change token sampling — are not distinguished in the post.
"Special snowflake" framing conflates two things. The per-call non-determinism of decoding is a separate phenomenon from cross-time drift; the post uses one metaphor for both.
Recalibration is named, not defined. The post asserts recalibration is needed without specifying what it is — continued pretraining, fine-tuning refresh, decoding-policy tuning, prompt-engineering rework, or eval-set retuning are all candidates.
No quantitative drift numbers on this page. Corless doesn't cite delta-percentage figures; the 2307.09009 paper has them but they aren't reproduced here.

Seen in¶

2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1: The coming brick walls (Peter Corless) (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — canonical: model drift as the "snowflakes melt over time" failure mode orthogonal to hallucination; GPT-5.1 < GPT-5.0 cited as a generation-over-generation regression example; operational recalibration named as the ongoing remediation.

concepts/llm-hallucination — the factual-wrongness failure mode; drift is orthogonal.
concepts/frontier-model-batch-training-boundary — the structural reason drift can't be fixed by simply streaming new data into the live model: training is batch.
concepts/llm-training-data-exhaustion — companion brick wall from the same post.
systems/transformer — the architecture primitive under drifting frontier models.
companies/redpanda — the company whose blog series canonicalises this framing.