CONCEPT Cited by 1 source
Data-quality checksum comparison¶
Definition¶
Data-quality checksum comparison is the canonical correctness primitive for verifying that two parallel sources of the same logical data agree: compute a row count and a checksum over each partition of each side, and compare the two pairs of numbers. If both numbers match, the partitions are byte-equivalent (modulo checksum collision). If either differs, the partitions disagree.
"There is no difference between the data delivered by the old system and the new system. We verify this by comparing both the row count and the checksum of the data, ensuring complete consistency between the two systems." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale
Why both row count and checksum¶
Each detects a class of error the other doesn't:
- Row count catches "this side is missing rows" and "this side has extra rows" — failures of the shape "the schemas match but the population doesn't."
- Checksum catches "this side has the same number of rows but the rows are different" — failures of the shape "the population matches but the values don't."
Either alone is insufficient: a row-count check passes if a row is updated to wrong values; a checksum check is unreliable if schema or column-order differs (canonical correct vs canonical by some other ordering).
Why this primitive scales¶
The cost is O(partitions), not O(rows). A petabyte-per-day ingestion pipeline can run row-count + checksum comparisons over its hourly partitions in seconds — reading actual rows for comparison would require hours to days. The cheap-to-compute, cheap-to-store property is what makes continuous comparison between two parallel pipelines feasible at hyperscale.
When a comparison fails, that's the operator's signal to zoom in on that specific partition with row-level analysis — e.g. via the hourly example-row-query primitive that finds the offending rows.
Operational cadence (Meta example)¶
Meta's migration runs the comparison for each landed shadow-table partition against the corresponding production-table partition (Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale). Mismatches are logged to Scuba; hourly, the data-quality analysis tool reads the mismatches and runs targeted queries to find example offending rows; debugging information is logged back to Scuba.
This is the detection layer for CDC bad-data propagation; once a mismatch is detected, the partition-quality-marking mechanism handles containment.
Distinguishing from related shapes¶
- vs point-in-time consistency check: runs at one moment comparing two snapshots; checksum comparison is continuous per-partition as data lands.
- vs bit-for-bit replication verification: requires both sides to be byte-identical including ordering; checksum comparison works on canonical ordering of rows within a partition.
- vs sample-based comparison: samples N rows from each side and compares; checksum comparison covers 100% of rows at bounded cost via the aggregate.
Seen in¶
- sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale — Meta's data-ingestion-system migration; canonical wiki instance combined with Scuba-as-warm-store + hourly-aggregate-tool.
Related¶
- concepts/migration-job-lifecycle — the gating system this primitive feeds
- concepts/cdc-bad-data-propagation — the hazard this primitive detects
- concepts/partition-quality-marking — the containment metadata triggered by detection
- patterns/data-quality-analysis-tool-with-edge-case-logging — the operational pattern wrapping this
- patterns/shadow-then-reverse-shadow-migration — the migration shape this lives inside
- systems/meta-data-ingestion-system — canonical wiki instance
- systems/scuba-meta — the warm-store substrate
- companies/meta — company hub