CONCEPT Cited by 1 source
Checksum-validated data migration¶
Definition¶
Checksum-validated data migration is the discipline of computing a deterministic checksum of source data before a migration step (split, copy, archive, re-shard, transform), recomputing the same checksum over the destination data after the step, and refusing to mark the migration COMPLETED unless the two checksums match. The pattern is a correctness gate on data movement: a checksum mismatch is the system's last line of defence against silent data loss / duplication / corruption introduced by the migration itself.
The pattern was canonicalised on the wiki by Netflix's TimeSeries Abstraction in the 2026-06-03 dynamic partition splitting disclosure (Source: sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads) — applied to per-partition splits validated online during the splitting pipeline, with offline Spark verification as a defence-in-depth secondary check via Data Bridge.
Mechanism (Netflix TimeSeries split instantiation)¶
Planning phase: read entire wide partition once → compute pre_split_checksum → store in wide_row metadata
Splitting phase: write split partitions to new table → compute post_split_checksum
Validation: if pre_split_checksum == post_split_checksum → flip status to COMPLETED
else → status remains in-flight; dynamic-split pipeline will retry
Verbatim: "The Planner stores a pre-split checksum of a given partition during the planning phase, while the Splitter computes and stores the post-split checksum. The split status is marked as completed only if the two checksums match."
The checksum is mandatory — only on COMPLETED status do TimeSeries servers begin loading the split's partition keys into the Bloom-filter gate. Until then, reads continue to flow to the original (non-split) partition.
Why this works for this pipeline¶
Immutability is the load-bearing precondition. When the source partition is provably not receiving new writes:
- Pre-checksum is stable between read and post-checksum computation.
- A mismatch cannot be explained by concurrent writes — it must be a real correctness bug (lost row, duplicated row, corrupted data, splitter logic error).
For mutable partitions, the same pattern requires either:
- A point-in-time snapshot at the start of migration (read amplification + storage cost), OR
- A dual-write / replay mechanism that captures writes during migration window and replays them on the target before the post-checksum (operational complexity).
This is one reason the Netflix team deferred mutable-partition splits as future work: the checksum gate would not be meaningful without one of those compensations.
Defence-in-depth: Spark via Data Bridge¶
The post discloses a secondary validation layer running offline:
"Using our existing Data Bridge pipelines to verify splits offline … Spark job to ensure that the split data is an exact match to the original data."
Data Bridge is Netflix's data-movement substrate (see Data Bridge: How Netflix Simplifies Data Movement, 2026, separately referenced but not summarised in this post). The Spark job runs after the online checksum gate has already passed and re-verifies row-by-row equality. Two independent checks at different altitudes:
| Check | Substrate | Latency | Catches |
|---|---|---|---|
| Online pre/post checksum | TimeSeries split worker | Synchronous (seconds–minutes) | Splitter logic errors, lost/duplicated rows |
| Offline Spark verify | Data Bridge / Spark job | Hours | Subtle correctness bugs the checksum's hash function might collide on; data-corruption issues during steady-state reads after split |
The composition is canonical multi-layer validation: the online check gates the rollout; the offline check provides eventual auditing.
Trade-offs¶
| Pro | Con |
|---|---|
| Cheap defence against silent data loss / duplication | Pre-checksum + post-checksum each requires reading the full source / target — storage I/O cost |
| Catches splitter logic bugs before they affect reads | Mutable sources require additional machinery (snapshot or dual-write) |
| Composable with offline secondary validation | Hash collisions theoretically possible (mitigated by offline Spark check) |
| Safe rollback: failed checksum ⇒ partition stays unsplit ⇒ original is read | Latency of the migration grows with the slowest of read-pre, write, and read-post |
| Status table doubles as audit log | Checksum algorithm must be consistent across pre and post (subtle bugs if libraries diverge) |
Sibling patterns¶
- concepts/data-quality-checksum-comparison — checksum comparison applied at the data-quality / pipeline-output level (not migration). Same primitive, different consumer.
- concepts/data-integrity-checker — broader integrity-check primitive.
- patterns/shadow-mode-bytes-comparison — Netflix TimeSeries' Comparison phase compares bytes served by old and new read paths during phased rollout. This sits one altitude up: it validates the read path itself, where checksum-validated migration validated the data movement. Both must be green for the rollout to advance.
- Cryptographic-hash-as-content-address (CAS-style) — taken to the limit, the checksum is the address. Different domain (immutable storage), same primitive.
Caveats not disclosed in the source¶
- Checksum algorithm not specified. The post does not name the hash function, sample size, or whether the checksum covers row data only or also tombstones / timestamps / TTLs.
- Hash-collision rate not characterised. Implicit in "only if the two checksums match" is the assumption that the hash space is large enough that collisions don't matter — defence-in-depth via Spark verification is the explicit hedge against this.
- Recovery semantics on mismatch not spelled out. The post says checksum mismatch keeps the split incomplete, but does not detail whether the splitter retries from the planner, blackholes the partition, alerts an operator, or some combination.
Seen in¶
- sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads —
Canonical wiki disclosure as the correctness gate in dynamic-partition-splitting.
Pre-split checksum stored in
wide_rowmetadata table during the planning phase; post-split checksum computed by the Splitter and matched online; offline secondary verification via Data Bridge Spark jobs that ensure split data is an exact row-by-row match to the original. "Serving incorrect reads would be disastrous. To establish trust beyond checksums, we leveraged additional mechanisms…" — explicit framing as the first in a stack of correctness checks, not the only one.
Related¶
- concepts/dynamic-partition-splitting — the canonical consumer of this validation primitive.
- concepts/data-quality-checksum-comparison · concepts/data-integrity-checker — sibling validation concepts at different altitudes.
- concepts/immutable-partition — load-bearing precondition that makes pre/post checksum meaningful.
- patterns/dynamic-partition-split-async-pipeline — the pipeline this gate sits inside.
- patterns/shadow-mode-bytes-comparison — sibling read-path validation.
- systems/netflix-data-bridge — the offline-validation substrate.
- systems/apache-spark — the offline-validation engine.