CONCEPT Cited by 1 source

LLM training-data exhaustion¶

Definition¶

LLM training-data exhaustion is the claim — formalised by Epoch AI (2024) — that frontier language models are approaching the upper bound of the publicly-available human-generated training corpus available on the open web. The scaling curve for usable public data has a top, and frontier models are within striking distance of it. Verbatim from Corless's 2026-01-13 Redpanda post:

"frontier AI models are running out of public data to train against. This means that AI is facing a brick wall problem. That there's a top to the S-curve, at least in terms of public training data." (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls)

This is the first of three brick walls Corless names for frontier AI; the other two are training-cost growth under Law-of-Diminishing-Returns and the batch-training boundary that limits how real-time data can reach the model.

The scale paradox — petabytes vs zettabytes¶

The training-data S-curve is measured in petabytes (10^15). The world's data production is measured in zettabytes (10^21) — six orders of magnitude larger. Verbatim numerics from the post:

"just one mobile network, AT&T, sent more than one exabyte (10^18) of data in 2025."
"all data generated and stored around the world is measured in zettabytes (10^21) … By the end of 2025, we generated an estimated 180 zettabytes in just one year and stored more than 200 zettabytes of total data globally."
CAGR 78% from 2016 to 2025.
Projected Yottabyte Era (10^24): 2028-2030.
Projected Ronnabyte Era (10^27): ~2040s (with expected S-curve roll-off).

The post's structural framing: frontier LLMs today consume only the tiny sliver of public-and-ethically-available data out of this global pool. The sliver is capped; the pool keeps growing; the delta is overwhelmingly private — IIoT, banking, cybersecurity, mobile devices, doorbell cameras, etc.

Two structural consequences¶

(1) The copyright-and-consent front. Public data isn't automatically legally-safe to train on. The post points at a wave of lawsuits: "multiple parties, from individual creators and coders to small publishers, large companies and consortia pushing back and suing, everywhere, all over the place on the unauthorized use of their copyrighted material. There are even AI lawsuit case trackers you can follow like a new sports league." Even when data is ethically-sourced and legally- scraped, "that does not prevent it from being used with nefarious intent."

(2) The pull toward private-data reservoirs. Verbatim: "This is where the AI systems want to go next: from publicly scrapable data to the vast oceans of private data reservoirs. This is where the war of the frontier AIs is being fought." Examples named: corporate IIoT, banking systems, cybersecurity telemetry, "your phone. Your kid's phone. Your camera doorbell." The structural tension: private data is orders of magnitude larger than public data, but access requires consent, contracts, and (per the post's Part 1 framing) streaming-style integration because much of it is generated continuously in real time — not as a historical corpus.

Why this is a brick wall¶

The training-data S-curve bounds the pre-training corpus that drives frontier model capability:

Bigger models need more data. Chinchilla-scale compute- data trade-offs require corpus growth commensurate with parameter growth.
Public-corpus growth is bounded by the top of the S-curve. Epoch AI's 2024 analysis projects the frontier models near the top.
Synthetic data has its own failure modes. Model collapse under recursive self-training is a known failure mode the post implicitly references via "hallucinations" and "slop" (post wording: "It doesn't fix human factors issues of poisoning, slop, or misuse.").
The delta lives in private hands. Accessing private data at scale is a separate problem class — legal, organisational, infrastructural.

Caveats¶

Epoch AI projection is interpretive. The 2024 Epoch AI analysis that Corless overlays is a projection based on estimates of scrapable web-text and usable tokens at quality thresholds. Projections can shift as data-quality thresholds change, as multimodal corpora (video, audio) enter the picture, and as non-English corpora scale.
Petabyte-scale ceiling is text-centric. Video-generation and multimodal foundation models consume corpora substantially larger than text-only LLMs. The brick wall is sharpest for text pre-training; multimodal ceilings are looser.
Synthetic data is a live research area. The post implies synthetic data doesn't fix the problem ("doesn't fix fundamental issues … poisoning, slop") but the empirical state of synthetic-data-at-scale is evolving; the brick-wall framing is a 2024-2025 snapshot.
Private-data access is not a direct training-data unlock. The post asserts the industry wants private data, not that private data is readily usable at training-corpus scale — consent, privacy, DLP, and streaming-ingest architecture all stand between private-data-in-existence and private-data-in- a-training-corpus.
Global-data-volume numbers are industry estimates. The 180 ZB/year and >200 ZB total figures come from Cybersecurity Ventures / industry-analyst sources; CAGR 78% is derived by Corless from those estimates.
This doesn't address whether more data yields better models. The companion claim in the post — "a bigger model doesn't always mean better results" + GPT-5.1 measurably worse than GPT-5.0 + embedding-dimension diminishing returns — cuts the other way: even if data were unbounded, capability growth isn't guaranteed.
Vendor framing. Corless is a Redpanda developer advocate arguing (across a four-part series) that streaming infrastructure is the unlock for private-data training. The brick-wall framing is real industry knowledge; the streaming-unlock is forward-looking vendor positioning.

Seen in¶

2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1 (Peter Corless) (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — canonical: the brick-wall framing, the petabyte-vs-zettabyte scale paradox, the public-to-private-data pivot.

concepts/s-curve-limits — the broader framing of diminishing returns that this concept is a specific instance of.
concepts/frontier-model-batch-training-boundary — the structural reason private data is hard to ingest: training is offline batch, private data is live streams.
concepts/llm-hallucination — the failure mode the post says more data doesn't fix.
systems/transformer — the architecture primitive whose scaling curve this bounds.
companies/redpanda — the company whose blog series canonicalises this framing.