Skip to content

PATTERN Cited by 1 source

REPLACE USING and REPLACE ON for selective overwrite

Problem

Selective overwrites — "replace the rows matching this predicate with these new rows, atomically" — were historically tied to partitioned table layouts. Databricks' Dynamic Partition Overwrite (legacy) is the canonical instance:

SET spark.sql.sources.partitionOverwriteMode = dynamic;
INSERT OVERWRITE TABLE events
SELECT * FROM staging WHERE date = '2025-01-15'
-- Replaces only the date='2025-01-15' partition in events
-- Requires partitioning by date
-- Requires the Spark config flag

Three structural limitations: - Tied to partitioned tables. Overwrite granularity = partition granularity = whatever column the table is partitioned on. - Requires a Spark config. "requires a Spark config" — opt-in global session state; easy to mis-set; not portable across compute surfaces. - Can't run on all compute surfaces. SQL warehouses, Serverless compute, and other non-classic-Spark execution paths may not support the legacy mechanism uniformly.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post addresses this directly as Myth #8:

Myth: "Being able to selectively overwrite data is only available through Dynamic Partition Overwrites."

Solution

Use REPLACE USING and REPLACE ON: layout-agnostic, compute- agnostic SQL primitives that selectively overwrite rows on any column predicate. The 2026-06-01 source:

"Selective overwrites work on Liquid tables natively. Databricks supports REPLACE USING and REPLACE ON, two SQL syntaxes for selectively overwriting data on any data layout: Liquid Clustered, partitioned, or plain unclustered tables. Unlike Dynamic Partition Overwrite which requires a Spark config, REPLACE USING and REPLACE ON can be used on any compute: classic clusters, SQL warehouses, and Serverless. The operation is atomic and matches on any column you choose."

Structural pieces

Piece What it does
REPLACE USING Selectively overwrites rows matched by a USING (key) clause; semantically a MERGE-flavoured upsert with replacement semantics.
REPLACE ON Selectively overwrites rows matched by an ON predicate clause; matches on any column expression.
Layout-agnostic Works on partitioned, Liquid-clustered, or unclustered tables.
Compute-agnostic Available on classic clusters, SQL warehouses, and Serverless. No Spark config required.
Atomic Operation is committed via the table format's transaction log; partial-overwrite states aren't visible to readers.
Any-column matching Matching predicate can use any column, not just the partition column.

In practice (illustrative — exact SQL syntax not disclosed in source)

-- Replace rows for a specific date
REPLACE INTO events ON date = '2025-01-15'
SELECT * FROM staging WHERE date = '2025-01-15';

-- Or via key matching
REPLACE INTO events USING (event_id)
SELECT * FROM staging WHERE date = '2025-01-15';

-- Both work on any layout: partitioned, clustered, unclustered
-- Both work on any compute: classic, SQL warehouse, Serverless
-- Both atomic: readers see old or new state, not partial

Why this matters

Decouples overwrite semantic from layout

Pre-REPLACE, the question "can I selectively overwrite?" was answered by "are you partitioned on the right column?". This forced architects to make partitioning decisions based on overwrite ergonomics rather than workload performance — yet another reason teams over-partitioned their tables.

With REPLACE USING / ON, overwrite semantic is available on any layout. The partition vs cluster decision becomes purely about filter / join performance; overwrite ergonomics no longer push the choice toward partitioning.

Removes the Spark-config opt-in

spark.sql.sources.partitionOverwriteMode = dynamic is a session-level config flag. Forgetting to set it produces "replace the entire table" semantics instead of "replace the matching partition" — a silently-destructive failure mode. REPLACE USING / ON is explicit in the SQL: the scope of the overwrite is in the query, not in a session config.

Portable across compute surfaces

Modern Lakehouse workloads run on multiple compute substrates: classic Spark clusters for ETL, SQL warehouses for BI, Serverless for ad-hoc analysis. Dynamic Partition Overwrite was tied to the classic Spark execution model; REPLACE USING / ON works identically across all three.

Composition with patterns/clustering-keys-as-engine-input

REPLACE USING / ON composes with clustering-as-engine-input: the overwrite operation works on the table's layout regardless of what the engine has chosen. Architects can change clustering keys without breaking selective-overwrite pipelines; the SQL surface is stable across layout evolution.

This is one of several layout-agnostic operational primitives that make Liquid Clustering's "layout as implementation detail" thesis operationally viable:

Primitive Layout-dependence pre-Liquid Layout-independence post-Liquid
Filter pruning Required matching partition column concepts/file-level-data-skipping on any column with stats
Selective overwrite Required partition alignment REPLACE USING / ON on any column
Metadata-only DELETE Required partition alignment concepts/metadata-only-operation on file-aligned predicates
Co-clustered join Required matching partition columns concepts/co-clustered-join on clustering keys

Sibling overwrite patterns on the wiki

  • MERGE INTO — the broader upsert primitive; supports complex match conditions and multi-action behaviour (UPDATE / DELETE / INSERT in one statement).
  • DELETE + INSERT — the manual decomposition; loses atomicity unless wrapped in an explicit transaction.
  • OVERWRITE + filter"DELETE WHERE … then INSERT" expressed as INSERT OVERWRITE … WHERE; legacy pattern; depends on whether the table format supports atomic insert-overwrite-with-filter.
  • Dynamic Partition Overwrite — the partition-tied mechanism REPLACE USING / ON supersedes.

REPLACE USING / ON sits between MERGE INTO (general) and Dynamic Partition Overwrite (partition-specific) — it has the layout- agnosticism of MERGE INTO with the simpler "replace these rows" semantic of Dynamic Partition Overwrite.

Seen in

  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfoFirst wiki canonicalisation as a named pattern. The Myth #8 debunking is the source's positioning: selective overwrite is no longer a partition-only capability. Names the cross-compute portability (classic, SQL warehouses, Serverless) and atomic property explicitly. Reserved for future ingests: exact SQL syntax (the post links to documentation but doesn't reproduce it inline), conflict semantics under concurrent REPLACE USING / ON operations, performance characteristics on large match sets, interaction with row-level concurrency.
Last updated · 542 distilled / 1,571 read