Skip to content

PATTERN Cited by 1 source

Snapshot sync mode for batch rebuild

Pattern. When a managed sync pipeline offers both snapshot (full replace) and triggered (incremental upsert) modes, choose snapshot mode whenever the per-cycle delta exceeds ~10% of the upstream table — even though snapshot rewrites everything, because the bulk-copy path is up to 10× faster than per-row incremental upsert at high delta proportions.

The pattern is canonicalised from the 2026-05-20 Databricks marketing-campaigns post, applied to Lakebase Synced Tables:

"When more than 10% of the data is updated, we recommend snapshot mode, which delivers 10x better performance than triggered mode."

When to apply

The decision variable is the delta proportion per sync cycle, not the cadence:

Workload shape Per-cycle delta Best mode
Nightly customer-segment recompute (replaces most rows) >10% Snapshot
Ad-hoc dictionary update (a few new categories) <10% Triggered
Real-time event stream (small rapid changes) <10% but latency-critical Continuous

Cadence is irrelevant to the choice — a daily nightly recompute that replaces 80% of segments still benefits from snapshot mode even though daily isn't real-time.

Why snapshot beats triggered at high deltas

The intuition is counterintuitive: snapshot rewrites the entire table, triggered only updates changed rows. Surely triggered is more efficient?

The crossover happens because:

  • Triggered pays a per-row diff/merge cost. Each changed row goes through a primary-key lookup, a conflict check, and an upsert. This cost scales linearly with delta size, with significant per-row overhead.
  • Snapshot is a bulk copy. Storage-compute-separated systems like Lakebase can stream a full table-load efficiently from object storage (where the snapshot reads from) without per-row conflict resolution.

For small deltas, the bulk-copy cost dominates and triggered wins. For large deltas, the per-row diff cost dominates and snapshot wins. The ~10% threshold is where the lines cross on the disclosed Lakebase Synced Tables implementation.

Canonical use case: marketing-campaign customer segments

From the 2026-05-20 post:

"in our case, and very often, customer segments are recomputed nightly in batch, replacing a significant portion of the dataset. When more than 10% of the data is updated, we recommend snapshot mode."

Customer segmentation pipelines typically:

  • Ingest user behaviour data over a 24-hour window.
  • Recompute every active segment from scratch (because criteria may overlap, segments aren't naturally incremental).
  • Replace the segment-membership table with the new computation output.

The result: nearly 100% of rows change every night. Triggered mode would do per-row upserts on millions of rows, paying the full per-row tax. Snapshot mode bulk-copies the new table in one shot.

Operational implications

  • Customer makes the call. The post frames this as a recommendation, not an automatic decision. The customer is expected to understand their workload's delta proportion and pick mode accordingly.
  • Mode is a per-table configuration. Different synced tables can use different modes; the decision is per-pipeline.
  • Cadence is a separate decision. Snapshot mode can be scheduled (nightly batch) or on-demand. The 10% rule determines mode; the freshness requirement determines cadence.

Generalisation beyond Lakebase

The pattern generalises wherever a sync pipeline offers a similar bulk-rebuild vs incremental tradeoff:

  • Database replication tools with both full-load and incremental-load modes.
  • Materialized view refresh strategies (full vs incremental).
  • Search index rebuilds vs incremental updates.

The crossover threshold (here ~10%) is implementation-specific — it depends on the per-row overhead and the bulk-copy efficiency of the specific system. Lakebase's 10% number is the disclosed threshold for Synced Tables specifically; other systems will have different crossover points.

Trade-offs and constraints

Pro Con
10× faster than triggered at high deltas Discards triggered's "only update what changed" advantage at low deltas
Bulk copy is operationally simpler (no per-row conflict resolution) Replaces the entire table — temporary disk/network spike during copy
Storage-compute separation makes bulk-copy efficient Not suitable for real-time freshness (latency floor is the snapshot cadence)

Anti-patterns this avoids

  • Always-incremental. Triggered mode is appealing because "only update what changed" sounds intuitively cheaper. For workloads that recompute most of the table, this intuition is wrong by 10×.
  • Manual sync code. Without managed snapshot mode, the customer would write and maintain a Lakehouse → OLTP refresh pipeline, with all the operational tax that implies.

Seen in

Last updated · 542 distilled / 1,571 read