PATTERN Cited by 2 sources
CDF incremental replacing full rescan¶
The CDF incremental replacing full rescan pattern uses Delta Lake's Change Data Feed to replace full-table rescans / overwrites of multi-terabyte upstream tables with change-driven incremental reads — "only records that have actually changed since the last run". In the Octopus Energy MHHS rebuild this single move dropped processed rows from 25 billion → 300 million per run (98.8% reduction) and improved freshness from weekly → daily, and was named the single highest-leverage optimisation of the project. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
Shape¶
naive shape (rejected):
upstream-source-of-truth-table ──FULL OVERWRITE──► downstream stream(s)
multi-terabyte every run billions of rows / run
CDF shape (adopted):
upstream-source-of-truth-table ─CDF: changed rows─► downstream stream(s)
multi-terabyte since last run millions of rows / run
(~99% reduction typical)
The pattern applies whenever:
- An upstream table is multi-terabyte / billions of rows.
- Each run produces far fewer changed rows than the table size.
- The downstream pipelines could compute the right output from "only the changed rows since last run" if they had access to that change set.
When to apply¶
Apply when all are true:
- The upstream table is on Delta Lake (or another OTF that exposes a change feed — Iceberg/Hudi equivalents work similarly).
- The upstream change rate is << the upstream table size (i.e., most rows don't change between runs).
- Downstream consumers can be re-expressed as functions of "changes since last watermark" rather than full-table rescans.
- You don't currently have a watermark / CDC mechanism, or you have a brittle one (high-watermark column, doesn't capture deletes, fragile under out-of-order updates).
The Octopus case satisfied all four. The legacy pipeline was re-running the entire multi-terabyte source-of-truth layer on every run.
Steps¶
- Enable CDF on the source-of-truth Delta table. This makes the per-version change events queryable as a structured stream of insert/update/delete rows.
- Track the last-processed version per downstream stream. Each stream maintains its own watermark (Delta version number), allowing different streams to consume the same change feed at different cadences.
- Re-express downstream logic as "apply changes since version N to version M" rather than "rescan the source-of-truth table". This may require:
- Idempotent merge logic (handle re-deliveries from CDF re-reads).
- Change-type-aware code paths (insert / update / delete each have different semantics for the downstream output).
- Bound the row count visible to downstream stages. The whole point is to make per-run row count bounded by change rate, not table size.
- Composes with grain-aligned stream split — each stream consumes the change feed at its own grain, with its own watermark.
Why this beats hand-rolled watermarks¶
The naive alternative — store a high-watermark column, filter rows above it — has well-known failure modes:
| Failure mode | What goes wrong |
|---|---|
| Doesn't capture deletes | High-watermark filtering misses deletions; downstream output drifts |
| Out-of-order updates | A late-arriving update with an older timestamp is invisible above the watermark |
| Per-table maintenance | Every table needs its own watermark column, schema, and care |
| Schema-fragile | Watermark columns become load-bearing; schema changes have to preserve them |
CDF avoids each of these because the change set is read from Delta's own commit history, which captures every row-level event including deletes, in commit order. The change feed is a property of the table, not a thing the downstream pipeline maintains.
Operational signature (Octopus disclosure)¶
| Metric | Before (full overwrite) | After (CDF) | Change |
|---|---|---|---|
| Rows processed / run | 25 billion | 300 million | 98.8% reduction |
| Data freshness | Weekly | Daily | 7× improvement |
| Pipeline grain | Monolithic full-table | Change-driven incremental | — |
The headline $1M / yr annualised cost-avoidance figure excludes the upstream-incremental savings. The full efficiency gain is materially larger. "Note: the $1M in annualised savings figures cited below exclude the additional savings from this move to incremental processing on upstream tables. The full efficiency gain is larger."
Trade-offs¶
- Idempotency requirement. Downstream merge / upsert logic must be idempotent because CDF can produce duplicate change events on recovery scenarios.
- Change-rate dependency. The pattern's value is proportional to (table size − change set size) / table size. Tables that change fully every run see no benefit.
- Re-bootstrap cost. Adding CDF to an existing pipeline means a one-time full-rescan to establish the watermark; subsequent runs benefit.
- Retention horizon. CDF retains change events only for a configured retention window (typically 30 days by default). A consumer that misses the window has to re-bootstrap.
- Doesn't help inside a stream. This pattern reduces the upstream read volume; it doesn't change the per-stream optimisation profile (broadcast joins, liquid clustering, AQE) applied within each stream.
Composition with related patterns¶
- patterns/grain-aligned-stream-split — the pattern this optimisation runs under. CDF on the shared multi-grain source-of-truth layer is what prevents the substrate from re-introducing the monolithic volume tax.
- patterns/job-of-jobs-orchestration — different streams consume the change feed at different cadences (HH settlement may read every 30 min; monthly may read once a day); the orchestrator schedules each consumption appropriately.
- CDF as Bronze→Silver promotion — the other canonical CDF shape on the wiki, from the Claroty source. That use case applies CDF inside a medallion promotion pipeline (append-only Bronze + mapping registry → Silver canonical schema). This pattern is the multi-terabyte upstream-substrate variant: same primitive, different layer.
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — canonical disclosure. Octopus Energy's multi-terabyte source-of-truth consumption layer; 25 B → 300 M rows / run; the single highest-leverage optimisation of the rebuild. Pattern applied to the shared substrate that underpins the three-stream pipeline.
- sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog — the "by enabling Change Data Feed (CDF), teams can process trace data incrementally, either in batch or streaming, without repeatedly scanning entire tables. This makes it possible to operationalize observability" face. Same pattern applied to UC OTel Trace Tables; the substrate is observability data instead of margin data, the architectural shape is the same.
Related¶
- Patterns: patterns/grain-aligned-stream-split · patterns/job-of-jobs-orchestration · patterns/broadcast-join-for-small-reference-tables
- Concepts: concepts/delta-change-data-feed · concepts/grain-misalignment · concepts/data-pipeline-grain
- Systems: systems/delta-lake · systems/octopus-margin-data-pipeline
- Companies: companies/octopus-energy · companies/databricks