Skip to content

CONCEPT Cited by 1 source

VARIANT type

VARIANT is a column-level data type that stores semi-structured (JSON-shaped) values with their internal structure preserved and queryable, without requiring the column's schema to be known at write time. It is the OTF analogue of the long-established VARIANT types in Snowflake, Databricks SQL, and Apache Spark — but standardised at the table-format spec level so that values written by one engine can be read identically by another.

A VARIANT column can hold:

  • A JSON object ({"foo": 1, "bar": "x", "nested": {"a": [1, 2, 3]}}).
  • A JSON array.
  • A JSON scalar (number, string, boolean, null).
  • Any mixture row-to-row — the type is genuinely heterogeneous.

The engine can query into VARIANT values with path expressions (event:user.id, payload['items'][0]['sku']) and project / cast values out of the structure.

Why it matters

VARIANT closes a structural gap in lakehouse table formats: how to ingest semi-structured event streams without flattening or stringifying them at write time.

Before VARIANT, the typical patterns were:

  1. Flatten at write time. Pre-define a wide schema with one column per JSON field, parse the JSON at ingest, and discard fields not in the schema. Loses fidelity; requires schema-evolution every time a new field appears upstream.
  2. Store as STRING and re-parse at query time. Lossless, but the engine cannot push down predicates or column projections into the JSON at query plan time — every read has to parse the full string. Slow at scale.
  3. Store as nested STRUCT with a fixed schema. Same loss-of-fidelity problem as flattening; doesn't handle heterogeneous payloads.
  4. Store as binary BLOB with custom serialisation. Fully opaque to the engine; cannot index, query, or analyse without a custom UDF round-trip.

VARIANT replaces all four with a first-class engine-aware type that:

  • Preserves the full JSON structure losslessly.
  • Allows path-based queries that the engine can plan and optimise.
  • Supports schema-on-read — fields can be cast / projected into typed columns lazily.
  • Doesn't require schema evolution every time upstream payloads change shape.

Compare concepts/json-column-as-schema-escape-hatch for the architectural pattern VARIANT formalises.

Iceberg v3 disclosure

Iceberg v3 adds VARIANT to the spec. From the 2026-05-28 announcement:

"VARIANT provides a standard representation for semi-structured data."

Cross-format point: "These features also work seamlessly across both Delta and Iceberg tables, enabling interoperability without rewriting data." (Source: sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance)

The architectural significance is standardisation across formats and engines. Snowflake's VARIANT, Spark's VARIANT, and Databricks SQL's VARIANT have been format-internal until now; v3 brings the type to the open Iceberg spec, so a VARIANT column written by an Iceberg writer is portable across engines that read v3.

Caveats

  • On-disk encoding deferred to spec. The announcing source does not document the binary encoding of VARIANT values; refer to the Iceberg v3 spec and Databricks docs for details.
  • Engine-side support varies. v3 VARIANT is GA on Databricks; other Iceberg engines need v3-aware readers and a VARIANT-aware query planner. Compatibility-matrix not disclosed.
  • No quantitative numbers on storage overhead, query-path overhead, or compaction interaction.
  • Indexing strategy undisclosed. Whether VARIANT columns can be indexed (analogous to PostgreSQL's GIN-on-JSONB or Snowflake's micro-partition statistics on VARIANT fields) is not addressed in the announcing source.

Seen in

Last updated · 542 distilled / 1,571 read