PATTERN Cited by 1 source
External table over Iceberg metadata pointer¶
External table over Iceberg metadata pointer is the query-
engine-side pattern where a query engine (BigQuery, Athena,
Snowflake external stage, Trino with HadoopCatalog, DuckDB)
registers an Apache Iceberg table by
pointing at a specific metadata JSON file in object storage —
using a CREATE EXTERNAL TABLE or equivalent DDL — rather than
authenticating against a REST catalog service.
Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc.
The shape¶
Three moving parts:
- A producer writes Parquet data files + Iceberg manifest + Iceberg metadata JSON to an object-storage bucket. Canonical producer in the source: Redpanda Iceberg Topics with the file-based catalog option enabled. Any Iceberg-writing producer works (Spark, Flink, Iceberg Java SDK, Connect sinks).
- A metadata pointer — the
vN.metadata.jsonobject key at a known bucket path. - A query engine registers the table via
CREATE EXTERNAL TABLE ... format = 'ICEBERG' metadata_file_paths = [...](BigQuery) or equivalent DDL (Athena, Snowflake, Trino).
Canonical BigQuery DDL verbatim from the source:
CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
format = 'ICEBERG',
metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);
The query engine opens the metadata JSON, walks its manifest tree to find the data files referenced by the current snapshot, and serves queries against those Parquet files.
Canonical instances¶
- BigQuery
CREATE EXTERNAL TABLE ... format = 'ICEBERG'— the source's demo. BigQuery opens the metadata JSON on GCS; query results reflect the snapshot referenced by that JSON at DDL time. - Amazon Athena external Iceberg tables — AWS's equivalent pattern for S3-hosted Iceberg tables queried by Athena.
- Snowflake external stages over Iceberg — Snowflake can read Iceberg tables in this shape as an alternative to registering them through REST catalog sync via Snowflake Open Catalog.
- Spark / Trino with
HadoopCatalog— configure the Iceberg catalog asiceberg.catalog-impl = org.apache.iceberg.hadoop.HadoopCatalogand point at a root path; the engine walks themetadata/directory convention to find the current snapshot. Pre-dates the BigQuery / Athena shape and is the original Hadoop-era pattern. - DuckDB
ICEBERG_SCAN/iceberg_scan— ad-hoc read from a metadata-file path.
Why this pattern wins¶
- Reader doesn't need REST catalog client. Query engines that support a metadata-pointer DDL don't need to authenticate to a separate catalog service; object-store IAM is sufficient.
- Customer bucket ownership. Especially strong in BYOC-data-ownership contexts where the customer already owns the bucket and wants the shortest possible data path from producer to query.
- Catalog-service decoupling. No REST-catalog availability dependency in the query path; the only runtime dependency is object-storage read access.
- Simple to get started. One-line DDL vs
catalog-service-setup + OIDC + client-library-install. Fastest
path from "there's an Iceberg table here" to "I can
SELECTagainst it".
Why this pattern loses¶
- Snapshot staleness. The external table references a specific metadata JSON version — new producer snapshots don't automatically become visible. Readers see a static snapshot pointer until the external table is re-created or the metadata path is updated. Verbatim from the source: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data."
- No cross-reader consistency. Each reader holds its own metadata-file pointer. Reader A and Reader B may see different snapshots of the same logical table.
- No table-level ACL. Object-key-scoped IAM is the policy surface; fine-grained per-column / per-row / time-travel policies aren't expressible.
- No concurrent-writer serialisation. Without a REST catalog mediating commits, multiple writers racing to update the metadata pointer can step on each other.
When to use REST catalog instead¶
- Multi-engine workload where snapshot consistency matters.
- Multi-writer workload (Redpanda Iceberg topic + Spark batch job
- Flink job → same table) requiring optimistic-concurrency commit coordination.
- Table-level ACL / RBAC requirements (Unity / Polaris).
- Auto-discovery of new tables across engines without client-side DDL.
Automation shape¶
Because the external table is static but the producer's snapshot isn't, this pattern usually needs an automation adjunct to keep the external-table pointer fresh:
- Scheduled refresh. Cron / Airflow job that
DROP + CREATE EXTERNAL TABLEat a periodic cadence (every N minutes). - Event-driven refresh. Object-storage event notification
(GCS Pub/Sub / S3 EventBridge) on new
vN.metadata.jsonwrites, triggering a catalog refresh. - BigQuery
ALTER EXTERNAL TABLE— GCP surface for in-place metadata-pointer update without drop/recreate.
The source defers to the BigQuery external-tables doc on metadata update for the update mechanism.
Relation to streaming-to-lakehouse shape¶
This pattern is the read-side companion to streaming- broker-as-lakehouse-Bronze-sink when the query engine is catalog-protocol-agnostic. The producer writes Iceberg (via Iceberg Topics or equivalent); the reader registers via metadata pointer; the net architecture is broker → object store → external-table DDL → query engine with no catalog service in between.
When the reader is catalog-native (Snowflake via Open Catalog, Databricks via Unity Catalog), the broker- native catalog registration pattern replaces this one — the reader sees the table appear automatically without any DDL.
Costs / caveats¶
- Metadata pointer discovery is an operator problem. The
producer writes
v1.metadata.json→v2.metadata.json→ ... in sequence; the reader must know which one is current. In REST catalogs, this is automatic; here it's a convention the operator maintains. - No time-travel UX. Time-travel queries against an external
table that's pinned to a specific snapshot are limited — you
can re-create the external table at a different snapshot, but
you can't express "
SELECT AS OF TIMESTAMP 'yesterday'" naturally. - Re-registration breaks query history / permissions. Every
DROP + CREATEre-issues IDs, invalidates cached plans, potentially resets grants. - Schema-evolution visibility. When the producer evolves the schema (add column), the external table must be re-created to see the new column; readers hitting the stale pointer continue to see the old schema shape.
- Cost-per-query implications on some engines. BigQuery
charges for scanning the data files referenced by the
metadata pointer;
SELECT *without partition pruning can scan the full Iceberg table.
Seen in¶
- sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc
— canonical wiki disclosure. Redpanda 2025-05-13 BYOC-beta
tutorial uses BigQuery
CREATE EXTERNAL TABLEover a GCS- hosted Iceberg metadata JSON as the reader-side integration for a Redpanda Iceberg topic configured with file-based catalog.
Related¶
- concepts/iceberg-file-based-catalog — the producer-side catalog shape this pattern reads from.
- concepts/iceberg-catalog-rest-sync — the catalog-service alternative this pattern replaces when readers are catalog- protocol-agnostic.
- concepts/iceberg-topic — the canonical streaming-to- Iceberg primitive that produces into this shape.
- systems/apache-iceberg — the table format.
- systems/redpanda-iceberg-topics — the producer in the canonical wiki source.
- systems/google-bigquery · systems/google-cloud-storage — the reader and object store in the worked example.
- patterns/streaming-broker-as-lakehouse-bronze-sink — the producer-side pattern this completes.
- patterns/broker-native-iceberg-catalog-registration — the REST-catalog sibling reader-side pattern.
- concepts/open-table-format — broader architectural context.
- concepts/byoc-data-ownership-for-iceberg — the trust-boundary context in which this pattern is typically preferred.