CONCEPT Cited by 1 source
Data integrity checker¶
Definition¶
A data integrity checker is a routine (typically SQL, sometimes custom code) that compares the output of a data pipeline against an independent source of truth and reports discrepancies — not as a blocking test before a single deploy, but as a recurring monitor that a running pipeline still produces correct output day after day.
The point is to treat correctness as an observable rather than a one-time deploy-gate property. Pipelines that looked correct yesterday may be incorrect today because (a) upstream source data evolved, (b) a new product type introduced an unforeseen edge case, (c) a downstream consumer's contract changed, or (d) a code change skipped the full edge-case coverage.
Canonical shape¶
Pipeline output ←──────→ Independent source of truth
│ │
└────────────┬───────────────┘
│ SQL (or equivalent)
▼
Set of metrics:
- match count
- mismatch count
- no-match-either-side count
- duplicate count
│
▼
Threshold check
(e.g. 99.99% match)
│
▼
Alert / investigate
Four metric classes (Yelp's canonical set)¶
The 2025-05-27 Yelp Revenue Data Pipeline post names four metrics for integrity checking a revenue pipeline against a billing-system source of truth. These generalise to any pipeline-vs-independent-truth comparison:
- Exact match rate — "Number of contracts matching invoice (billed transaction) gross revenue." Yelp's threshold: 99.99%. The primary correctness signal.
- Partial mismatch — "Number of contracts with mismatched discount invoice." A soft signal — pipeline and truth agree on existence but disagree on some field value. Often the most diagnostic; systematic mismatches point to specific code bugs.
- Left-side orphans — "Number of contract lines with no equivalent invoice." Pipeline produced output, truth didn't. Indicates the pipeline is emitting things the billing system doesn't agree exist (or doesn't bill).
- Duplicates / staleness — "Number of duplicate or past contract lines." Pipeline-internal consistency: no double counting; no expired contracts leaking through.
Both sides also get a right-side orphan count implicitly ("Number of invoice with no equivalent contract" in Yelp's sample monthly report) — truth has records the pipeline didn't produce.
Cadence tradeoff¶
Integrity checkers face a cadence tradeoff:
- Monthly (or lower-frequency) — the real source of truth is often periodic (Yelp's billing system closes monthly; many financial truths are monthly or quarterly). The full check against the authoritative truth can only run at that cadence.
- Daily (or higher-frequency) — waiting a month per feedback cycle makes fast iteration impossible. Daily checks use a proxy source of truth (e.g. the staging pipeline itself) to catch implementation bugs quickly.
Yelp runs both — see patterns/monthly-plus-daily-dual-cadence-integrity-check.
Iterative-discovery property¶
Integrity-checker SQL is not written once. From the 2025-05-27 Yelp post: "Determining the right SQL queries was done iteratively, accounting for different edge cases based on the variety of products offered at Yelp." Each new edge case discovered in production drives a new query clause (or a new metric). The SQL itself is a living representation of the team's understanding of what "correct" means for this pipeline.
Caveats¶
- Thresholds are domain-specific. 99.99% match is appropriate for revenue where a 0.01% error of $1B revenue is $100k — tolerable. A safety-critical domain needs 100% reconciliation.
- Bilateral mismatch requires bilateral investigation. Left orphans and right orphans have different root causes; reporting one without the other hides half the defect class.
- Integrity checkers are not tests. Tests have a known answer; integrity checkers discover unknown answers against an independent dataset. The skill is reading the output and diagnosing, not re-running until green.
Related concepts¶
- concepts/weekly-integrity-reconciliation — a lower- frequency audit loop (e.g. fork-sync automation)
- concepts/staging-pipeline — the substrate that enables high-cadence checks
Seen in¶
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical disclosure. Four named metrics with 99.99% match threshold; monthly-check SQL against Redshift billing tables
- daily-check SQL via Redshift Spectrum on staging Glue tables; iterative SQL evolution.