CONCEPT Cited by 1 source

VARIANT type¶

VARIANT is a column-level data type that stores semi-structured (JSON-shaped) values with their internal structure preserved and queryable, without requiring the column's schema to be known at write time. It is the OTF analogue of the long-established VARIANT types in Snowflake, Databricks SQL, and Apache Spark — but standardised at the table-format spec level so that values written by one engine can be read identically by another.

A VARIANT column can hold:

A JSON object ({"foo": 1, "bar": "x", "nested": {"a": [1, 2, 3]}}).
A JSON array.
A JSON scalar (number, string, boolean, null).
Any mixture row-to-row — the type is genuinely heterogeneous.

The engine can query into VARIANT values with path expressions (event:user.id, payload['items'][0]['sku']) and project / cast values out of the structure.

Why it matters¶

VARIANT closes a structural gap in lakehouse table formats: how to ingest semi-structured event streams without flattening or stringifying them at write time.

Before VARIANT, the typical patterns were:

Flatten at write time. Pre-define a wide schema with one column per JSON field, parse the JSON at ingest, and discard fields not in the schema. Loses fidelity; requires schema-evolution every time a new field appears upstream.
Store as STRING and re-parse at query time. Lossless, but the engine cannot push down predicates or column projections into the JSON at query plan time — every read has to parse the full string. Slow at scale.
Store as nested STRUCT with a fixed schema. Same loss-of-fidelity problem as flattening; doesn't handle heterogeneous payloads.
Store as binary BLOB with custom serialisation. Fully opaque to the engine; cannot index, query, or analyse without a custom UDF round-trip.

VARIANT replaces all four with a first-class engine-aware type that:

Preserves the full JSON structure losslessly.
Allows path-based queries that the engine can plan and optimise.
Supports schema-on-read — fields can be cast / projected into typed columns lazily.
Doesn't require schema evolution every time upstream payloads change shape.

Compare concepts/json-column-as-schema-escape-hatch for the architectural pattern VARIANT formalises.

Iceberg v3 disclosure¶

Iceberg v3 adds VARIANT to the spec. From the 2026-05-28 announcement:

"VARIANT provides a standard representation for semi-structured data."

Cross-format point: "These features also work seamlessly across both Delta and Iceberg tables, enabling interoperability without rewriting data." (Source: sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance)

The architectural significance is standardisation across formats and engines. Snowflake's VARIANT, Spark's VARIANT, and Databricks SQL's VARIANT have been format-internal until now; v3 brings the type to the open Iceberg spec, so a VARIANT column written by an Iceberg writer is portable across engines that read v3.

Caveats¶

On-disk encoding deferred to spec. The announcing source does not document the binary encoding of VARIANT values; refer to the Iceberg v3 spec and Databricks docs for details.
Engine-side support varies. v3 VARIANT is GA on Databricks; other Iceberg engines need v3-aware readers and a VARIANT-aware query planner. Compatibility-matrix not disclosed.
No quantitative numbers on storage overhead, query-path overhead, or compaction interaction.
Indexing strategy undisclosed. Whether VARIANT columns can be indexed (analogous to PostgreSQL's GIN-on-JSONB or Snowflake's micro-partition statistics on VARIANT fields) is not addressed in the announcing source.

Seen in¶

sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance — GA announcement of Iceberg v3 VARIANT on Databricks. Named alongside deletion vectors and row tracking as one of three v3 primitives; cross-format compatibility with Delta disclosed; no mechanism depth.

systems/iceberg-v3 — v3 milestone introducing VARIANT to the Iceberg spec.
systems/apache-iceberg — parent table format.
systems/delta-lake — sibling format; cross-compatible VARIANT.
systems/snowflake — pre-existing VARIANT-shaped type that v3 standardises across engines.
concepts/schema-evolution — the cost VARIANT removes from semi-structured-data ingestion.
concepts/json-column-as-schema-escape-hatch — the architectural pattern VARIANT formalises in OTFs.
concepts/open-table-format — umbrella concept.

VARIANT type¶

Why it matters¶

Iceberg v3 disclosure¶

Caveats¶

Seen in¶

Related¶