CONCEPT Cited by 1 source
Iceberg file-based catalog¶
An Iceberg file-based catalog is a catalog-integration shape in which a Apache Iceberg-writing producer publishes its table's current snapshot pointer as a metadata file directly in object storage, and Iceberg-aware readers open the table by pointing at that metadata file's object key — bypassing a REST catalog protocol entirely.
Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc.
The shape¶
Where REST catalog sync stores
the table's current-snapshot pointer in a managed HTTP service
(Snowflake Open Catalog, Databricks Unity, AWS Glue) that readers
authenticate against, a file-based catalog stores the pointer as an
ordinary JSON file in the same object store that holds the data. A
reader with read access to the bucket can discover and query the
table by referencing the latest vN.metadata.json file directly.
Canonical reader-side integration is
Google BigQuery's CREATE EXTERNAL TABLE primitive (verbatim from
the source):
CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
format = 'ICEBERG',
metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);
The pattern also applies to Amazon Athena external Iceberg
tables, Snowflake external-stage Iceberg reads, and standalone
Spark / Trino / DuckDB sessions with
iceberg.catalog-impl = org.apache.iceberg.hadoop.HadoopCatalog or
equivalent "point at a metadata JSON" configuration.
Relationship to the broker-owned object-store catalog fallback¶
Redpanda 25.1 GA canonicalised a "built-in object-store-based catalog" as a fallback when no REST catalog is configured — "suitable for ad hoc access by data engineers when no REST catalog is available." The 2025-05-13 BYOC tutorial reframes this as a primary integration option for engines (BigQuery) that read Iceberg directly from a metadata pointer rather than speak the REST-catalog protocol:
"Direct integration with popular REST catalogs like Snowflake Open Catalog, or with Iceberg clients like Google BigQuery via a file-based catalog."
Whether the object-store fallback and the file-based catalog are the same broker-owned mechanism or distinct shapes is not clarified by the BYOC post. The wiki treats them as aspects of the same underlying property: Iceberg metadata lives in object storage as a JSON file, and any reader with bucket access can walk it.
Trade-offs vs REST catalog¶
| Axis | File-based catalog | REST catalog |
|---|---|---|
| Snapshot discovery | Reader sees a static pointer; must re-point on new snapshot | Reader re-queries catalog; always sees latest |
| Auth | Object-store IAM (bucket-level) | OIDC token against catalog service |
| ACL granularity | Object-key-scoped | Table-scoped |
| Cross-engine consistency | Each engine maintains its own metadata-file pointer | Single source of truth |
| Catalog availability | No separate service to fail | Catalog is a write-path dependency |
| Auto-registration | Reader must know the metadata path | Auto-register works |
When file-based catalog wins¶
- Reader doesn't speak REST catalog protocol. BigQuery, Athena,
Spark
HadoopCatalogconfigs — pointing at a metadata JSON is native; paying for a REST catalog just to query the table is overhead. - Single-reader workload. When only one engine reads the table, the cross-engine-consistency property of a REST catalog isn't load-bearing.
- Customer owns the bucket and wants direct-read access without a middleman. The BYOC-data-ownership framing: customer-owned bucket + customer- owned query engine = no need for a Redpanda-operated catalog endpoint in the query path.
- Catalog availability is a write-path liability the operator wants to avoid. REST catalogs couple producer availability to catalog availability; the file-based shape decouples them (at the cost of losing cross-engine consistency).
When REST catalog wins¶
- Multi-engine / multi-writer workload where snapshot consistency across readers matters.
- Organisations with table-level ACL requirements — object-scoped IAM can't express "Alice can read the Orders table but not the PII-Orders table".
- Discovery: REST catalogs list tables; file-based catalogs don't — readers have to know the metadata-path convention.
- Schema federation across cloud accounts / clouds — Unity / Polaris / Glue are designed for this; file-based catalogs aren't.
Costs / caveats¶
- No auto-refresh on new snapshots. BigQuery external tables
have to be re-created (or updated via
ALTER EXTERNAL TABLE) to see a newervN.metadata.json. Verbatim from the source: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data." This makes the file-based shape semi-static; it's not a live-streaming reader. - No table-level ACL. Object-key-scoped IAM is the policy surface. Fine-grained per-column, per-row, or time-travel policies aren't expressible.
- No cross-writer serialisation. Multiple producers writing concurrently to the same Iceberg table without a REST-catalog- mediated commit protocol can overwrite each other's metadata pointers; safe concurrent multi-writer access requires the REST-catalog path with optimistic-concurrency commits.
- Schema-evolution visibility across engines. Each reader holds its own metadata-file pointer; a producer's schema change must be explicitly re-pointed for each reader.
Seen in¶
- sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc
— canonical wiki disclosure. Redpanda 25.1 BYOC-beta Iceberg
Topics walkthrough uses the file-based catalog via BigQuery
CREATE EXTERNAL TABLEon a GCS-hosted Iceberg metadata JSON. Framed as an alternative to REST-catalog sync for Iceberg clients like BigQuery that read directly from a metadata pointer.
Related¶
- concepts/iceberg-catalog-rest-sync — the sibling catalog shape (REST catalog protocol) that a file-based catalog replaces or complements.
- concepts/iceberg-topic · systems/redpanda-iceberg-topics — the broker-native producer that can target either catalog shape.
- systems/apache-iceberg — the table format.
- systems/google-bigquery — the canonical file-based- catalog reader the source demoes.
- systems/google-cloud-storage — the object store hosting the metadata JSON in the demo.
- concepts/byoc-data-ownership-for-iceberg — the BYOC context in which file-based catalog is most often preferred.
- patterns/external-table-over-iceberg-metadata-pointer — the pattern that consumes a file-based catalog from the query- engine side.
- concepts/open-table-format — broader architectural context.