Skip to content

REDPANDA 2026-01-13

Read original ↗

Redpanda — The convergence of AI and data streaming, Part 1: The coming brick walls

Summary

First instalment of a four-part industry-commentary series by Peter Corless (Redpanda), distilled from his talk at the AI-by-the-Bay conference in Oakland. The post is Tier-3 vendor-voice thought leadership — no production numbers, no architecture diagrams of a shipping system, heavy on industry narrative and a running "d20 test" gag for image-generation models. It passes the Tier-3 filter on vocabulary grounds: the post canonicalises an industry-wide framing that every frontier LLM today is fundamentally a batch-trained system (GPT-4/5, Gemini, Grok, Claude) and argues that the next wave of AI capability will require grafting real-time streaming data into a substrate architected for offline batch. Corless names three "brick walls" frontier AI is running into: (1) ethically-sourced public training data is running out — Epoch AI (2024) projects frontier models near the top of the public-data S-curve measured in petabytes, while the world generates ~180 zettabytes/year and stores >200 ZB, most of it private; (2) training cost grows ~260% per year with frontier-model training projected to exceed $1B by 2027 per Epoch AI, bounded by the Law of Diminishing Returns; (3) real-time training capability is limited — RAG and MCP expose real-time data at inference time, but pre-training and RLHF remain offline batch pipelines. The post also canonicalises two named architectural shapes of frontier LLMs — Mixture of Experts (MoE) with concrete real-world parameter counts (GPT-4 = 8 × 220B per George Hotz's 2023 disclosure; Gemini MoE since 1.5; Grok MoE since Grok-1) vs Dense Transformer (Anthropic Claude remains single-model) — and the observation that embedding dimensionality hits diminishing returns past ~1,536 dimensions (citing Supabase's pgvector post). Also names LLM model drift over time: GPT-5.1 measurably worse than GPT-5.0 on some evals, and "models can actually degrade in performance with usage over time, even over the span of a few months" (arXiv 2307.09009). Frames the "data scientists vs data engineers" organisational silo with a reference to Jesse Anderson's Data Teams. Parts 2-4 promised to cover adaptive LLM strategies, AI observability/evaluation, and real-time streaming + AI respectively; Part 1 is the problem statement, not the solution.

Key takeaways

  1. Every frontier LLM today is batch-trained. Verbatim: "regardless of their dense or MoE architectures, they're still all batch trained." Corless lands on this as the structural premise for the entire four-part series: to push past the coming brick walls, the industry will need to find ways to use real-time streaming data in training / retraining / fine-tuning, not just at inference time. Canonicalised on the wiki as concepts/frontier-model-batch-training-boundary — the structural property of the current LLM generation that separates the training side (offline batch) from the serving side (can reason on real-time data via RAG or MCP) and why closing the gap is the stated trajectory of data-streaming infrastructure.

  2. Frontier LLMs are almost all Mixture of Experts by 2025; Claude is the notable dense-transformer exception. Verbatim disclosures with canonical dates:

  3. GPT-4"it was leaked by George Hotz (geohotz) in 2023 that OpenAI's GPT-4 was actually not a single 1.76 trillion parameter model, but 8 × 220 billion parameter models running in parallel. We can presume something similar is true of GPT-5 and 5.1."
  4. Google Gemini"We also know Google Gemini has been an MoE since 1.5" (cited Google blog post from Feb 2024).
  5. xAI Grok"Grok has been an MoE since Grok-1" (cited Cameron Wolfe Substack).
  6. Anthropic Claude"However, Anthropic Claude remains a single model, known as a Dense Transformer." Extends the wiki's prior MoE canonicalisation — Pinterest's per-task MMoE in recsys — with the per-token LLM MoE deployment shape and real-world parameter counts. The Claude-as-Dense-Transformer exception gets its own wiki page: concepts/dense-transformer.

  7. GPT-1 to GPT-5 is five orders of magnitude of parameter growth in eight years. Verbatim: "GPT-1 'only' had 117 million parameters. It's now estimated GPT 5/5.1 could grow anywhere upwards of 50 trillion parameters and 400,000 token context windows. That's five orders of magnitude larger in eight years." Applied to Transformer as the architecture reference — the scaling-curve numerics that drive the three brick walls.

  8. Brick wall #1 — ethically-sourced public training data is running out. Verbatim framing per Epoch AI (2024): "frontier AI models are running out of public data to train against. This means that AI is facing a brick wall problem. That there's a top to the S-curve, at least in terms of public training data." Scale contrast verbatim: "it's talking about data on orders of magnitude of petabytes (10^15). Yet just one mobile network, AT&T, sent more than one exabyte (10^18) of data in 2025. Now, all data generated and stored around the world is measured in zettabytes (10^21)… By the end of 2025, we generated an estimated 180 zettabytes in just one year and stored more than 200 zettabytes of total data globally." CAGR 78% 2016→2025; projected Yottabyte Era 2028-2030. The vast majority is in private hands (IIoT, banking, cybersecurity, phones, doorbell cameras). Canonicalised as concepts/llm-training-data-exhaustion — the public-data-S-curve + private-data-reservoir framing that drives both the next wave of copyright lawsuits and the industry's pull toward streaming-exposed private data.

  9. Brick wall #2 — training cost grows ~260% per year, projected >$1B by 2027. Verbatim: "Epoch AI predicts that training frontier models will cost over a billion dollars by 2027" + "a consideration of the Y-axis shows the cost to train AI models is actually growing around 260% annually. While this isn't a hard brick wall, it'll be governed by the Law of Diminishing Returns." Reinforces the S-curve-limits framing — the post's recurring meta-claim that every growth axis in AI (data volume, parameter count, training cost, capability uplift) is heading into diminishing returns. "it might be simply infeasible economically, as well as producing a computationally negligible return on investment." Cites Nature April 2025 projection — "Data centres will use twice as much energy by 2030 — driven by AI" — as the energy-consumption corollary.

  10. Brick wall #3 — real-time training/retraining is the missing capability. Verbatim: "While they can increasingly access and reason upon data presented in real time, such as scouring social media video and the latest posts and newsfeeds, or accessing a database in a RAG or MCP architecture, this is at inference time. Their extensive pre-training and much of their fine-tuning, such as Reinforced Learning from Human Feedback (RLHF), is still inherently offline, batch-mode oriented." Canonicalised as concepts/rlhf-offline-batch — RLHF as the named fine-tuning pipeline that currently sits on the batch side of the training-serving boundary. Also names RLHF's own failure modes verbatim: "RLHF also has numerous limitations; some are tractable, others are fundamental and inherent to it, including misalignment and safety" (cites arXiv 2307.15217).

  11. Bigger model ≠ better results: diminishing returns in embedding dimensionality and model generations. Verbatim: "They realized that going up to or even beyond 1,536 vector embedding dimensions can have diminishing returns. GPT-5.1 actually produced slightly worse results than GPT-5.0 on some recent evaluations. Plus, we've seen that models can actually degrade in performance with usage over time, even over the span of a few months." Cites Supabase's pgvector blog post + arXiv 2307.09009 (LLM drift). Two separate canonicalisations: concepts/embedding-dimension-diminishing-returns (the vector-dimensionality ceiling) + concepts/llm-model-drift (the behavioural-drift-over-time failure mode, the model-API contract analogue verbatim: "Unlike an API, if you fire it up and it works in January, it should still work in June, and provide the same, correct answer each time. With LLMs, each answer is a special snowflake, and those snowflakes can melt over time."). Both extend the wiki's pre-existing LLM hallucination canonical coverage with failure modes orthogonal to factual-wrongness.

  12. The data-science vs data-engineering organisational silo is the pre-requisite to cross. Verbatim: "AI and data streaming are all too often worlds apart. The data scientists working on AI and the platform, and data engineers working on data streaming technologies, typically live on separate ends of the business campus. The world of AI is based on batch data training and often unstructured documents. The world of data streaming is tied more to structured data and databases." Cites Jesse Anderson's Data Teams. Frames the socio-technical half of the convergence thesis: the architectural convergence follows the organisational one. Part 4 of the series is promised as the streaming-specific payoff; parts 2-3 cover adaptive LLM strategies and AI observability/evaluation respectively.

Operational / architectural numbers

  • GPT-1 → GPT-5 parameter growth: 117M → ~50T, five orders of magnitude in 8 years (2017-2025).
  • GPT-5 context window (estimate): 400,000 tokens.
  • GPT-4 architecture (leaked, 2023): 8 × 220B parameter experts = 1.76T nominal total, running as MoE (George Hotz disclosure).
  • Gemini MoE since: version 1.5 (Feb 2024).
  • Grok MoE since: Grok-1.
  • Anthropic Claude: Dense Transformer, still single model.
  • Embedding dimensionality ceiling: diminishing returns past 1,536 dimensions (Supabase pgvector).
  • Training cost growth: ~260% per year, projected >$1B by 2027 (Epoch AI).
  • Data-centre energy consumption: projected 2× by 2030, driven by AI (Nature, April 2025).
  • Global data volume 2025: ~180 ZB/year generated, >200 ZB stored, CAGR 78% 2016-2025.
  • Yottabyte Era (projection): 2028-2030.
  • Ronnabyte Era (projection): ~2040s (post notes expected S-curve).
  • AT&T mobile network (2025): >1 exabyte of data sent in 2025 — single-carrier figure dwarfing the petabyte-scale LLM-training-data ceiling.
  • Global AI spend 2025: $1.5 trillion.
  • Models that passed the d20 test: Gemini 3.0 Thinking (once, not consistently); Nano Banana Pro (failed a rerun). All others — ChatGPT 5.x, Midjourney, Meta AI, Grok, Claude, Google Veo — "fail" per Corless's running blog thread.

Systems, concepts, and patterns extracted

Caveats

  • Tier-3 industry-commentary voice. Peter Corless is a Redpanda developer advocate; the post is a published blog version of a conference talk. No production numbers from a shipping Redpanda system; the post explicitly defers streaming-specific payoff to Part 4.
  • Hearsay primary sources. Key numerics are cited via second-hand reporting: GPT-4 = 8 × 220B is the George Hotz leak ("leaked"), not an OpenAI disclosure; Gemini / Grok MoE-since dates are cited from blog posts; the GPT-5 = ~50T parameter count is "estimated". Treat all frontier-model numerics as directional, not authoritative.
  • No Redpanda-specific architectural content in Part 1. The post is framing; the "batch training is the brick wall, streaming is the unlock" argument is made at rhetorical altitude without a concrete streaming-into-training architecture.
  • d20 test is a running gag, not a rigorous eval. The post treats Corless's image-generation prompt as an "evaluation opener" to introduce the topic; no claim of methodological rigor. The d20 thread is mainly external LinkedIn-post links.
  • Epoch AI projections are interpretive. The "top of the S-curve" + "$1B by 2027" + "260% annual growth" numbers come from Epoch AI's external analysis; Corless overlays the current-ChatGPT point on their chart. The underlying projections are model-dependent and the post doesn't interrogate the assumptions.
  • Embedding-dimension ceiling is single-sourced. The "1,536 dimensions" claim cites Supabase's pgvector optimisation post — a vendor-post-about-vectorDB-performance rather than a foundational paper.
  • Private-data ethics argument is stated, not analysed. The transition from "public data exhaustion" to "private data reservoirs" is narrated, not structurally analysed — the question of how private data would be ingested into training at scale is deferred.
  • No comparison to DeepSeek / Mistral / Qwen MoE. The MoE landscape disclosure covers GPT-4, Gemini, Grok, Claude only. Mistral's Mixtral (per-token sparse MoE), DeepSeek's MoE, Qwen MoE are all absent from the post.
  • RLHF is named but not walked. Corless cites arXiv 2307.15217 for RLHF limitations; the post doesn't walk through RLHF's actual pipeline shape (reward model + PPO / DPO / GRPO etc.) or the specific "misalignment and safety" modes it gestures at.
  • Series promises vs post delivery. The post promises "how, increasingly, real-time data enrichment and streaming are needed to take the AI industry to new levels of capabilities" — that substance is explicitly pushed to Parts 2-4. Part 1 is the problem statement only.

Source

Last updated · 470 distilled / 1,213 read