Skip to content

PATTERN Cited by 1 source

External table over Iceberg metadata pointer

External table over Iceberg metadata pointer is the query- engine-side pattern where a query engine (BigQuery, Athena, Snowflake external stage, Trino with HadoopCatalog, DuckDB) registers an Apache Iceberg table by pointing at a specific metadata JSON file in object storage — using a CREATE EXTERNAL TABLE or equivalent DDL — rather than authenticating against a REST catalog service.

Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc.

The shape

Three moving parts:

  1. A producer writes Parquet data files + Iceberg manifest + Iceberg metadata JSON to an object-storage bucket. Canonical producer in the source: Redpanda Iceberg Topics with the file-based catalog option enabled. Any Iceberg-writing producer works (Spark, Flink, Iceberg Java SDK, Connect sinks).
  2. A metadata pointer — the vN.metadata.json object key at a known bucket path.
  3. A query engine registers the table via CREATE EXTERNAL TABLE ... format = 'ICEBERG' metadata_file_paths = [...] (BigQuery) or equivalent DDL (Athena, Snowflake, Trino).

Canonical BigQuery DDL verbatim from the source:

CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
  format = 'ICEBERG',
  metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);

The query engine opens the metadata JSON, walks its manifest tree to find the data files referenced by the current snapshot, and serves queries against those Parquet files.

Canonical instances

  • BigQuery CREATE EXTERNAL TABLE ... format = 'ICEBERG' — the source's demo. BigQuery opens the metadata JSON on GCS; query results reflect the snapshot referenced by that JSON at DDL time.
  • Amazon Athena external Iceberg tables — AWS's equivalent pattern for S3-hosted Iceberg tables queried by Athena.
  • Snowflake external stages over Iceberg — Snowflake can read Iceberg tables in this shape as an alternative to registering them through REST catalog sync via Snowflake Open Catalog.
  • Spark / Trino with HadoopCatalog — configure the Iceberg catalog as iceberg.catalog-impl = org.apache.iceberg.hadoop.HadoopCatalog and point at a root path; the engine walks the metadata/ directory convention to find the current snapshot. Pre-dates the BigQuery / Athena shape and is the original Hadoop-era pattern.
  • DuckDB ICEBERG_SCAN / iceberg_scan — ad-hoc read from a metadata-file path.

Why this pattern wins

  • Reader doesn't need REST catalog client. Query engines that support a metadata-pointer DDL don't need to authenticate to a separate catalog service; object-store IAM is sufficient.
  • Customer bucket ownership. Especially strong in BYOC-data-ownership contexts where the customer already owns the bucket and wants the shortest possible data path from producer to query.
  • Catalog-service decoupling. No REST-catalog availability dependency in the query path; the only runtime dependency is object-storage read access.
  • Simple to get started. One-line DDL vs catalog-service-setup + OIDC + client-library-install. Fastest path from "there's an Iceberg table here" to "I can SELECT against it".

Why this pattern loses

  • Snapshot staleness. The external table references a specific metadata JSON version — new producer snapshots don't automatically become visible. Readers see a static snapshot pointer until the external table is re-created or the metadata path is updated. Verbatim from the source: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data."
  • No cross-reader consistency. Each reader holds its own metadata-file pointer. Reader A and Reader B may see different snapshots of the same logical table.
  • No table-level ACL. Object-key-scoped IAM is the policy surface; fine-grained per-column / per-row / time-travel policies aren't expressible.
  • No concurrent-writer serialisation. Without a REST catalog mediating commits, multiple writers racing to update the metadata pointer can step on each other.

When to use REST catalog instead

  • Multi-engine workload where snapshot consistency matters.
  • Multi-writer workload (Redpanda Iceberg topic + Spark batch job
  • Flink job → same table) requiring optimistic-concurrency commit coordination.
  • Table-level ACL / RBAC requirements (Unity / Polaris).
  • Auto-discovery of new tables across engines without client-side DDL.

Automation shape

Because the external table is static but the producer's snapshot isn't, this pattern usually needs an automation adjunct to keep the external-table pointer fresh:

  • Scheduled refresh. Cron / Airflow job that DROP + CREATE EXTERNAL TABLE at a periodic cadence (every N minutes).
  • Event-driven refresh. Object-storage event notification (GCS Pub/Sub / S3 EventBridge) on new vN.metadata.json writes, triggering a catalog refresh.
  • BigQuery ALTER EXTERNAL TABLE — GCP surface for in-place metadata-pointer update without drop/recreate.

The source defers to the BigQuery external-tables doc on metadata update for the update mechanism.

Relation to streaming-to-lakehouse shape

This pattern is the read-side companion to streaming- broker-as-lakehouse-Bronze-sink when the query engine is catalog-protocol-agnostic. The producer writes Iceberg (via Iceberg Topics or equivalent); the reader registers via metadata pointer; the net architecture is broker → object store → external-table DDL → query engine with no catalog service in between.

When the reader is catalog-native (Snowflake via Open Catalog, Databricks via Unity Catalog), the broker- native catalog registration pattern replaces this one — the reader sees the table appear automatically without any DDL.

Costs / caveats

  • Metadata pointer discovery is an operator problem. The producer writes v1.metadata.jsonv2.metadata.json → ... in sequence; the reader must know which one is current. In REST catalogs, this is automatic; here it's a convention the operator maintains.
  • No time-travel UX. Time-travel queries against an external table that's pinned to a specific snapshot are limited — you can re-create the external table at a different snapshot, but you can't express "SELECT AS OF TIMESTAMP 'yesterday'" naturally.
  • Re-registration breaks query history / permissions. Every DROP + CREATE re-issues IDs, invalidates cached plans, potentially resets grants.
  • Schema-evolution visibility. When the producer evolves the schema (add column), the external table must be re-created to see the new column; readers hitting the stale pointer continue to see the old schema shape.
  • Cost-per-query implications on some engines. BigQuery charges for scanning the data files referenced by the metadata pointer; SELECT * without partition pruning can scan the full Iceberg table.

Seen in

Last updated · 470 distilled / 1,213 read