Skip to content

SYSTEM Cited by 1 source

R2 Data Catalog (Cloudflare managed Iceberg)

R2 Data Catalog is Cloudflare's managed Apache Iceberg service — the cold/warm storage tier of Town Lake. Tables live as Parquet files on R2 under Iceberg's open table format; the catalog is operated by Cloudflare so internal teams don't run their own Iceberg metastores.

What Iceberg buys Town Lake

From the launch post:

  • Schema evolution — add/remove/rename columns without rewriting data files.
  • Time travel — query historical snapshots.
  • Partition evolution — change partitioning strategy without full-table rewrites.
  • Recompaction as data ages"per-minute usage from last week becomes hourly, hourly from last quarter becomes daily, etc." The recency-tiered re-aggregation is the storage-cost-vs-fidelity dial.

The framing is explicit: "R2 Data Catalog, our managed Apache Iceberg service, is where the cold and warm data lives... The storage cost decreases as recency does, while the data stays queryable. Parquet files in R2 are much cheaper compared to keeping the same data in an OLAP database."

Position in Town Lake

R2 Data Catalog is the storage tier that Town Lake's other components route around:

  • Trino queries Iceberg tables on R2 alongside Postgres + ClickHouse + BigQuery in a single plan.
  • Ingestion writes Parquet files into Iceberg tables — "an orchestrator runs as a long-lived Kubernetes deployment, reads pipeline configs, and spawns short-lived worker jobs to extract from Postgres or ClickHouse, transform to Parquet, and load into R2 as Iceberg tables." Each pipeline runs as either full-replace or incremental-append.
  • Transformer runs ELT DAGs that produce Iceberg tables (with definitions stored in R2 alongside the data).
  • Future direction: "R2 SQL, Cloudflare's serverless, distributed, analytics query engine, is getting more and more robust by the day. As its feature set expands, we plan to move many parts of Town Lake's workflow over to it." R2 SQL is the named successor query layer for some of Trino's responsibilities.

Relation to broader R2 substrate

R2 Data Catalog is a managed-service shape on top of R2, not a separate storage product. It joins R2's growing list of substrate roles documented in this wiki:

  • Tier 0 workspace for Project Think agents.
  • Managed-storage backend for AI Search instances.
  • Pack-file snapshot store for Artifacts repos.
  • Email-attachment store for Agentic Inbox.
  • Filesystem-mount target via sandbox.mountBucket() in Sandbox SDK.
  • Managed-Iceberg-catalog tier for Town Lake (this entry).

The pattern — "single primitive (R2) wrapped in domain-specific managed services" — is the platform-coherence shape.

Caveats

The launch post discloses what Iceberg gives Town Lake and the recompaction shape, but does not disclose:

  • R2 Data Catalog as a customer-facing product (whether it's available outside Cloudflare's internal use).
  • Cost numbers for the lakehouse vs the previous external-vendor state.
  • Iceberg metadata-size pressure (the second-order problem ClickHouse's parts-mutex contention surfaces in Ready-Analytics does not have an obvious analogue here, but Iceberg metadata growth at scale is a known operational concern the post does not address).
  • Compaction throughput / latency / cost numbers.

Seen in

Last updated · 542 distilled / 1,571 read