Skip to content

CONCEPT Cited by 1 source

Iceberg file-based catalog

An Iceberg file-based catalog is a catalog-integration shape in which a Apache Iceberg-writing producer publishes its table's current snapshot pointer as a metadata file directly in object storage, and Iceberg-aware readers open the table by pointing at that metadata file's object key — bypassing a REST catalog protocol entirely.

Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc.

The shape

Where REST catalog sync stores the table's current-snapshot pointer in a managed HTTP service (Snowflake Open Catalog, Databricks Unity, AWS Glue) that readers authenticate against, a file-based catalog stores the pointer as an ordinary JSON file in the same object store that holds the data. A reader with read access to the bucket can discover and query the table by referencing the latest vN.metadata.json file directly.

Canonical reader-side integration is Google BigQuery's CREATE EXTERNAL TABLE primitive (verbatim from the source):

CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
  format = 'ICEBERG',
  metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);

The pattern also applies to Amazon Athena external Iceberg tables, Snowflake external-stage Iceberg reads, and standalone Spark / Trino / DuckDB sessions with iceberg.catalog-impl = org.apache.iceberg.hadoop.HadoopCatalog or equivalent "point at a metadata JSON" configuration.

Relationship to the broker-owned object-store catalog fallback

Redpanda 25.1 GA canonicalised a "built-in object-store-based catalog" as a fallback when no REST catalog is configured — "suitable for ad hoc access by data engineers when no REST catalog is available." The 2025-05-13 BYOC tutorial reframes this as a primary integration option for engines (BigQuery) that read Iceberg directly from a metadata pointer rather than speak the REST-catalog protocol:

"Direct integration with popular REST catalogs like Snowflake Open Catalog, or with Iceberg clients like Google BigQuery via a file-based catalog."

Whether the object-store fallback and the file-based catalog are the same broker-owned mechanism or distinct shapes is not clarified by the BYOC post. The wiki treats them as aspects of the same underlying property: Iceberg metadata lives in object storage as a JSON file, and any reader with bucket access can walk it.

Trade-offs vs REST catalog

Axis File-based catalog REST catalog
Snapshot discovery Reader sees a static pointer; must re-point on new snapshot Reader re-queries catalog; always sees latest
Auth Object-store IAM (bucket-level) OIDC token against catalog service
ACL granularity Object-key-scoped Table-scoped
Cross-engine consistency Each engine maintains its own metadata-file pointer Single source of truth
Catalog availability No separate service to fail Catalog is a write-path dependency
Auto-registration Reader must know the metadata path Auto-register works

When file-based catalog wins

  • Reader doesn't speak REST catalog protocol. BigQuery, Athena, Spark HadoopCatalog configs — pointing at a metadata JSON is native; paying for a REST catalog just to query the table is overhead.
  • Single-reader workload. When only one engine reads the table, the cross-engine-consistency property of a REST catalog isn't load-bearing.
  • Customer owns the bucket and wants direct-read access without a middleman. The BYOC-data-ownership framing: customer-owned bucket + customer- owned query engine = no need for a Redpanda-operated catalog endpoint in the query path.
  • Catalog availability is a write-path liability the operator wants to avoid. REST catalogs couple producer availability to catalog availability; the file-based shape decouples them (at the cost of losing cross-engine consistency).

When REST catalog wins

  • Multi-engine / multi-writer workload where snapshot consistency across readers matters.
  • Organisations with table-level ACL requirements — object-scoped IAM can't express "Alice can read the Orders table but not the PII-Orders table".
  • Discovery: REST catalogs list tables; file-based catalogs don't — readers have to know the metadata-path convention.
  • Schema federation across cloud accounts / clouds — Unity / Polaris / Glue are designed for this; file-based catalogs aren't.

Costs / caveats

  • No auto-refresh on new snapshots. BigQuery external tables have to be re-created (or updated via ALTER EXTERNAL TABLE) to see a newer vN.metadata.json. Verbatim from the source: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data." This makes the file-based shape semi-static; it's not a live-streaming reader.
  • No table-level ACL. Object-key-scoped IAM is the policy surface. Fine-grained per-column, per-row, or time-travel policies aren't expressible.
  • No cross-writer serialisation. Multiple producers writing concurrently to the same Iceberg table without a REST-catalog- mediated commit protocol can overwrite each other's metadata pointers; safe concurrent multi-writer access requires the REST-catalog path with optimistic-concurrency commits.
  • Schema-evolution visibility across engines. Each reader holds its own metadata-file pointer; a producer's schema change must be explicitly re-pointed for each reader.

Seen in

Last updated · 470 distilled / 1,213 read