PATTERN Cited by 1 source

External table over Iceberg metadata pointer¶

External table over Iceberg metadata pointer is the query- engine-side pattern where a query engine (BigQuery, Athena, Snowflake external stage, Trino with HadoopCatalog, DuckDB) registers an Apache Iceberg table by pointing at a specific metadata JSON file in object storage — using a CREATE EXTERNAL TABLE or equivalent DDL — rather than authenticating against a REST catalog service.

Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc.

The shape¶

Three moving parts:

A producer writes Parquet data files + Iceberg manifest + Iceberg metadata JSON to an object-storage bucket. Canonical producer in the source: Redpanda Iceberg Topics with the file-based catalog option enabled. Any Iceberg-writing producer works (Spark, Flink, Iceberg Java SDK, Connect sinks).
A metadata pointer — the vN.metadata.json object key at a known bucket path.
A query engine registers the table via CREATE EXTERNAL TABLE ... format = 'ICEBERG' metadata_file_paths = [...] (BigQuery) or equivalent DDL (Athena, Snowflake, Trino).

Canonical BigQuery DDL verbatim from the source:

CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
  format = 'ICEBERG',
  metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);

The query engine opens the metadata JSON, walks its manifest tree to find the data files referenced by the current snapshot, and serves queries against those Parquet files.

Canonical instances¶

BigQuery CREATE EXTERNAL TABLE ... format = 'ICEBERG' — the source's demo. BigQuery opens the metadata JSON on GCS; query results reflect the snapshot referenced by that JSON at DDL time.
Amazon Athena external Iceberg tables — AWS's equivalent pattern for S3-hosted Iceberg tables queried by Athena.
Snowflake external stages over Iceberg — Snowflake can read Iceberg tables in this shape as an alternative to registering them through REST catalog sync via Snowflake Open Catalog.
Spark / Trino with HadoopCatalog — configure the Iceberg catalog as iceberg.catalog-impl = org.apache.iceberg.hadoop.HadoopCatalog and point at a root path; the engine walks the metadata/ directory convention to find the current snapshot. Pre-dates the BigQuery / Athena shape and is the original Hadoop-era pattern.
DuckDB ICEBERG_SCAN / iceberg_scan — ad-hoc read from a metadata-file path.

Why this pattern wins¶

Reader doesn't need REST catalog client. Query engines that support a metadata-pointer DDL don't need to authenticate to a separate catalog service; object-store IAM is sufficient.
Customer bucket ownership. Especially strong in BYOC-data-ownership contexts where the customer already owns the bucket and wants the shortest possible data path from producer to query.
Catalog-service decoupling. No REST-catalog availability dependency in the query path; the only runtime dependency is object-storage read access.
Simple to get started. One-line DDL vs catalog-service-setup + OIDC + client-library-install. Fastest path from "there's an Iceberg table here" to "I can SELECT against it".

Why this pattern loses¶

Snapshot staleness. The external table references a specific metadata JSON version — new producer snapshots don't automatically become visible. Readers see a static snapshot pointer until the external table is re-created or the metadata path is updated. Verbatim from the source: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data."
No cross-reader consistency. Each reader holds its own metadata-file pointer. Reader A and Reader B may see different snapshots of the same logical table.
No table-level ACL. Object-key-scoped IAM is the policy surface; fine-grained per-column / per-row / time-travel policies aren't expressible.
No concurrent-writer serialisation. Without a REST catalog mediating commits, multiple writers racing to update the metadata pointer can step on each other.

When to use REST catalog instead¶

Multi-engine workload where snapshot consistency matters.
Multi-writer workload (Redpanda Iceberg topic + Spark batch job
Flink job → same table) requiring optimistic-concurrency commit coordination.
Table-level ACL / RBAC requirements (Unity / Polaris).
Auto-discovery of new tables across engines without client-side DDL.

Automation shape¶

Because the external table is static but the producer's snapshot isn't, this pattern usually needs an automation adjunct to keep the external-table pointer fresh:

Scheduled refresh. Cron / Airflow job that DROP + CREATE EXTERNAL TABLE at a periodic cadence (every N minutes).
Event-driven refresh. Object-storage event notification (GCS Pub/Sub / S3 EventBridge) on new vN.metadata.json writes, triggering a catalog refresh.
BigQuery ALTER EXTERNAL TABLE — GCP surface for in-place metadata-pointer update without drop/recreate.

The source defers to the BigQuery external-tables doc on metadata update for the update mechanism.

Relation to streaming-to-lakehouse shape¶

This pattern is the read-side companion to streaming- broker-as-lakehouse-Bronze-sink when the query engine is catalog-protocol-agnostic. The producer writes Iceberg (via Iceberg Topics or equivalent); the reader registers via metadata pointer; the net architecture is broker → object store → external-table DDL → query engine with no catalog service in between.

When the reader is catalog-native (Snowflake via Open Catalog, Databricks via Unity Catalog), the broker- native catalog registration pattern replaces this one — the reader sees the table appear automatically without any DDL.

Costs / caveats¶

Metadata pointer discovery is an operator problem. The producer writes v1.metadata.json → v2.metadata.json → ... in sequence; the reader must know which one is current. In REST catalogs, this is automatic; here it's a convention the operator maintains.
No time-travel UX. Time-travel queries against an external table that's pinned to a specific snapshot are limited — you can re-create the external table at a different snapshot, but you can't express "SELECT AS OF TIMESTAMP 'yesterday'" naturally.
Re-registration breaks query history / permissions. Every DROP + CREATE re-issues IDs, invalidates cached plans, potentially resets grants.
Schema-evolution visibility. When the producer evolves the schema (add column), the external table must be re-created to see the new column; readers hitting the stale pointer continue to see the old schema shape.
Cost-per-query implications on some engines. BigQuery charges for scanning the data files referenced by the metadata pointer; SELECT * without partition pruning can scan the full Iceberg table.

Seen in¶

sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc — canonical wiki disclosure. Redpanda 2025-05-13 BYOC-beta tutorial uses BigQuery CREATE EXTERNAL TABLE over a GCS- hosted Iceberg metadata JSON as the reader-side integration for a Redpanda Iceberg topic configured with file-based catalog.

concepts/iceberg-file-based-catalog — the producer-side catalog shape this pattern reads from.
concepts/iceberg-catalog-rest-sync — the catalog-service alternative this pattern replaces when readers are catalog- protocol-agnostic.
concepts/iceberg-topic — the canonical streaming-to- Iceberg primitive that produces into this shape.
systems/apache-iceberg — the table format.
systems/redpanda-iceberg-topics — the producer in the canonical wiki source.
systems/google-bigquery · systems/google-cloud-storage — the reader and object store in the worked example.
patterns/streaming-broker-as-lakehouse-bronze-sink — the producer-side pattern this completes.
patterns/broker-native-iceberg-catalog-registration — the REST-catalog sibling reader-side pattern.
concepts/open-table-format — broader architectural context.
concepts/byoc-data-ownership-for-iceberg — the trust-boundary context in which this pattern is typically preferred.