Skip to content

CONCEPT Cited by 1 source

Data-quality checksum comparison

Definition

Data-quality checksum comparison is the canonical correctness primitive for verifying that two parallel sources of the same logical data agree: compute a row count and a checksum over each partition of each side, and compare the two pairs of numbers. If both numbers match, the partitions are byte-equivalent (modulo checksum collision). If either differs, the partitions disagree.

"There is no difference between the data delivered by the old system and the new system. We verify this by comparing both the row count and the checksum of the data, ensuring complete consistency between the two systems." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why both row count and checksum

Each detects a class of error the other doesn't:

  • Row count catches "this side is missing rows" and "this side has extra rows" — failures of the shape "the schemas match but the population doesn't."
  • Checksum catches "this side has the same number of rows but the rows are different" — failures of the shape "the population matches but the values don't."

Either alone is insufficient: a row-count check passes if a row is updated to wrong values; a checksum check is unreliable if schema or column-order differs (canonical correct vs canonical by some other ordering).

Why this primitive scales

The cost is O(partitions), not O(rows). A petabyte-per-day ingestion pipeline can run row-count + checksum comparisons over its hourly partitions in seconds — reading actual rows for comparison would require hours to days. The cheap-to-compute, cheap-to-store property is what makes continuous comparison between two parallel pipelines feasible at hyperscale.

When a comparison fails, that's the operator's signal to zoom in on that specific partition with row-level analysis — e.g. via the hourly example-row-query primitive that finds the offending rows.

Operational cadence (Meta example)

Meta's migration runs the comparison for each landed shadow-table partition against the corresponding production-table partition (Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale). Mismatches are logged to Scuba; hourly, the data-quality analysis tool reads the mismatches and runs targeted queries to find example offending rows; debugging information is logged back to Scuba.

This is the detection layer for CDC bad-data propagation; once a mismatch is detected, the partition-quality-marking mechanism handles containment.

  • vs point-in-time consistency check: runs at one moment comparing two snapshots; checksum comparison is continuous per-partition as data lands.
  • vs bit-for-bit replication verification: requires both sides to be byte-identical including ordering; checksum comparison works on canonical ordering of rows within a partition.
  • vs sample-based comparison: samples N rows from each side and compares; checksum comparison covers 100% of rows at bounded cost via the aggregate.

Seen in

Last updated · 542 distilled / 1,571 read