Skip to content

CONCEPT Cited by 3 sources

The log is the truth, the database is a cache

Definition

The truth is the log. The database is a cache of a subset of the log. A framing — originating in Kleppmann's CIDR 2015 paper "Turning the database inside out" — that inverts the usual database-first mental model: the authoritative source of record for an organisation's data is a replayable, immutable, append-only log of events; every database, search index, cache, feature store, or warehouse downstream is a materialised view of some subset of that log.

Canonical citation on the wiki

Alex Gallego (Redpanda founder) cites this framing as the explicit premise for starting Redpanda (Source: Gallego 2025-04-03):

"The systems community implemented a re-playable log of events with microbatching. Think of it as microservices, consuming and producing to stable APIs like RabbitMQ, Apache Kafka®, Redpanda. When people were done with it, it felt like turning a database inside out, where most businesses looked like control plane databases or simply views of the log."

"The truth is the log. The database is a cache of a subset of the log."

"I started Redpanda with the premise that batch was a historical artefact due to a lack of mental tools and, in part, a lack of intuitive industrial streaming implementations that offer a new way of reasoning about the world."

Consequences of the inversion

When the log is authoritative and the database is derived:

  1. Schema evolution is log-first. Schema changes become events in the log, not schema migrations against a master store; downstream materialisations pick up the new shape when they're ready.
  2. Point-in-time reconstruction is free. Any view can be rebuilt by replaying from offset 0; backfills and new consumers are architecturally symmetric.
  3. Systems become views. ClickHouse, Snowflake, Elasticsearch, Redis, Postgres read replicas — all are "cached projections" of some topic-level subset.
  4. Operational boundary flips. The log is the durability substrate; downstream stores are optimisation / query-shape concerns. This is why streaming brokers historically wanted to be the system of record, not a buffer in front of one.

Two-layer materialisation: log + lakehouse

The 2020s version of the inversion adds a second axis: not just databases-as-cache-of-log, but Apache Iceberg tables as cold-storage projection of the log, with the hot streaming topic as the real-time tail. Iceberg topics — where a single logical entity is simultaneously a Kafka topic and an Iceberg table — is the structural realisation of this: you get the log-as-truth + lakehouse- as-query-engine with no external ETL (patterns/streaming-broker-as-lakehouse-bronze-sink).

Relationship to continuous computation

If the log is truth, then "batch" is just continuous computation observed through a temporal window. Gallego's next move — canonicalised as concepts/continuous-computation-convergence — is that modern query engines (Databricks, Snowflake, BigQuery) absorb the batch-vs-streaming complexity once they can read the same Iceberg table, so you use the lakehouse for backfill and the low-latency stream for tailing.

Seen in

Last updated · 470 distilled / 1,213 read