Skip to content

PATTERN Cited by 2 sources

CDF incremental replacing full rescan

The CDF incremental replacing full rescan pattern uses Delta Lake's Change Data Feed to replace full-table rescans / overwrites of multi-terabyte upstream tables with change-driven incremental reads"only records that have actually changed since the last run". In the Octopus Energy MHHS rebuild this single move dropped processed rows from 25 billion → 300 million per run (98.8% reduction) and improved freshness from weekly → daily, and was named the single highest-leverage optimisation of the project. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Shape

naive shape (rejected):
upstream-source-of-truth-table  ──FULL OVERWRITE──► downstream stream(s)
  multi-terabyte                  every run             billions of rows / run

CDF shape (adopted):
upstream-source-of-truth-table ─CDF: changed rows─► downstream stream(s)
  multi-terabyte                 since last run        millions of rows / run
                                                       (~99% reduction typical)

The pattern applies whenever:

  • An upstream table is multi-terabyte / billions of rows.
  • Each run produces far fewer changed rows than the table size.
  • The downstream pipelines could compute the right output from "only the changed rows since last run" if they had access to that change set.

When to apply

Apply when all are true:

  • The upstream table is on Delta Lake (or another OTF that exposes a change feed — Iceberg/Hudi equivalents work similarly).
  • The upstream change rate is << the upstream table size (i.e., most rows don't change between runs).
  • Downstream consumers can be re-expressed as functions of "changes since last watermark" rather than full-table rescans.
  • You don't currently have a watermark / CDC mechanism, or you have a brittle one (high-watermark column, doesn't capture deletes, fragile under out-of-order updates).

The Octopus case satisfied all four. The legacy pipeline was re-running the entire multi-terabyte source-of-truth layer on every run.

Steps

  1. Enable CDF on the source-of-truth Delta table. This makes the per-version change events queryable as a structured stream of insert/update/delete rows.
  2. Track the last-processed version per downstream stream. Each stream maintains its own watermark (Delta version number), allowing different streams to consume the same change feed at different cadences.
  3. Re-express downstream logic as "apply changes since version N to version M" rather than "rescan the source-of-truth table". This may require:
  4. Idempotent merge logic (handle re-deliveries from CDF re-reads).
  5. Change-type-aware code paths (insert / update / delete each have different semantics for the downstream output).
  6. Bound the row count visible to downstream stages. The whole point is to make per-run row count bounded by change rate, not table size.
  7. Composes with grain-aligned stream split — each stream consumes the change feed at its own grain, with its own watermark.

Why this beats hand-rolled watermarks

The naive alternative — store a high-watermark column, filter rows above it — has well-known failure modes:

Failure mode What goes wrong
Doesn't capture deletes High-watermark filtering misses deletions; downstream output drifts
Out-of-order updates A late-arriving update with an older timestamp is invisible above the watermark
Per-table maintenance Every table needs its own watermark column, schema, and care
Schema-fragile Watermark columns become load-bearing; schema changes have to preserve them

CDF avoids each of these because the change set is read from Delta's own commit history, which captures every row-level event including deletes, in commit order. The change feed is a property of the table, not a thing the downstream pipeline maintains.

Operational signature (Octopus disclosure)

Metric Before (full overwrite) After (CDF) Change
Rows processed / run 25 billion 300 million 98.8% reduction
Data freshness Weekly Daily 7× improvement
Pipeline grain Monolithic full-table Change-driven incremental

The headline $1M / yr annualised cost-avoidance figure excludes the upstream-incremental savings. The full efficiency gain is materially larger. "Note: the $1M in annualised savings figures cited below exclude the additional savings from this move to incremental processing on upstream tables. The full efficiency gain is larger."

Trade-offs

  • Idempotency requirement. Downstream merge / upsert logic must be idempotent because CDF can produce duplicate change events on recovery scenarios.
  • Change-rate dependency. The pattern's value is proportional to (table size − change set size) / table size. Tables that change fully every run see no benefit.
  • Re-bootstrap cost. Adding CDF to an existing pipeline means a one-time full-rescan to establish the watermark; subsequent runs benefit.
  • Retention horizon. CDF retains change events only for a configured retention window (typically 30 days by default). A consumer that misses the window has to re-bootstrap.
  • Doesn't help inside a stream. This pattern reduces the upstream read volume; it doesn't change the per-stream optimisation profile (broadcast joins, liquid clustering, AQE) applied within each stream.
  • patterns/grain-aligned-stream-split — the pattern this optimisation runs under. CDF on the shared multi-grain source-of-truth layer is what prevents the substrate from re-introducing the monolithic volume tax.
  • patterns/job-of-jobs-orchestration — different streams consume the change feed at different cadences (HH settlement may read every 30 min; monthly may read once a day); the orchestrator schedules each consumption appropriately.
  • CDF as Bronze→Silver promotion — the other canonical CDF shape on the wiki, from the Claroty source. That use case applies CDF inside a medallion promotion pipeline (append-only Bronze + mapping registry → Silver canonical schema). This pattern is the multi-terabyte upstream-substrate variant: same primitive, different layer.

Seen in

Last updated · 542 distilled / 1,571 read