Skip to content

CONCEPT Cited by 1 source

Data integrity checker

Definition

A data integrity checker is a routine (typically SQL, sometimes custom code) that compares the output of a data pipeline against an independent source of truth and reports discrepancies — not as a blocking test before a single deploy, but as a recurring monitor that a running pipeline still produces correct output day after day.

The point is to treat correctness as an observable rather than a one-time deploy-gate property. Pipelines that looked correct yesterday may be incorrect today because (a) upstream source data evolved, (b) a new product type introduced an unforeseen edge case, (c) a downstream consumer's contract changed, or (d) a code change skipped the full edge-case coverage.

Canonical shape

Pipeline output  ←──────→  Independent source of truth
       │                            │
       └────────────┬───────────────┘
                    │ SQL (or equivalent)
            Set of metrics:
            - match count
            - mismatch count
            - no-match-either-side count
            - duplicate count
            Threshold check
            (e.g. 99.99% match)
            Alert / investigate

Four metric classes (Yelp's canonical set)

The 2025-05-27 Yelp Revenue Data Pipeline post names four metrics for integrity checking a revenue pipeline against a billing-system source of truth. These generalise to any pipeline-vs-independent-truth comparison:

  • Exact match rate"Number of contracts matching invoice (billed transaction) gross revenue." Yelp's threshold: 99.99%. The primary correctness signal.
  • Partial mismatch"Number of contracts with mismatched discount invoice." A soft signal — pipeline and truth agree on existence but disagree on some field value. Often the most diagnostic; systematic mismatches point to specific code bugs.
  • Left-side orphans"Number of contract lines with no equivalent invoice." Pipeline produced output, truth didn't. Indicates the pipeline is emitting things the billing system doesn't agree exist (or doesn't bill).
  • Duplicates / staleness"Number of duplicate or past contract lines." Pipeline-internal consistency: no double counting; no expired contracts leaking through.

Both sides also get a right-side orphan count implicitly ("Number of invoice with no equivalent contract" in Yelp's sample monthly report) — truth has records the pipeline didn't produce.

Cadence tradeoff

Integrity checkers face a cadence tradeoff:

  • Monthly (or lower-frequency) — the real source of truth is often periodic (Yelp's billing system closes monthly; many financial truths are monthly or quarterly). The full check against the authoritative truth can only run at that cadence.
  • Daily (or higher-frequency) — waiting a month per feedback cycle makes fast iteration impossible. Daily checks use a proxy source of truth (e.g. the staging pipeline itself) to catch implementation bugs quickly.

Yelp runs both — see patterns/monthly-plus-daily-dual-cadence-integrity-check.

Iterative-discovery property

Integrity-checker SQL is not written once. From the 2025-05-27 Yelp post: "Determining the right SQL queries was done iteratively, accounting for different edge cases based on the variety of products offered at Yelp." Each new edge case discovered in production drives a new query clause (or a new metric). The SQL itself is a living representation of the team's understanding of what "correct" means for this pipeline.

Caveats

  • Thresholds are domain-specific. 99.99% match is appropriate for revenue where a 0.01% error of $1B revenue is $100k — tolerable. A safety-critical domain needs 100% reconciliation.
  • Bilateral mismatch requires bilateral investigation. Left orphans and right orphans have different root causes; reporting one without the other hides half the defect class.
  • Integrity checkers are not tests. Tests have a known answer; integrity checkers discover unknown answers against an independent dataset. The skill is reading the output and diagnosing, not re-running until green.

Seen in

Last updated · 476 distilled / 1,218 read