CONCEPT Cited by 1 source

Data integrity checker¶

Definition¶

A data integrity checker is a routine (typically SQL, sometimes custom code) that compares the output of a data pipeline against an independent source of truth and reports discrepancies — not as a blocking test before a single deploy, but as a recurring monitor that a running pipeline still produces correct output day after day.

The point is to treat correctness as an observable rather than a one-time deploy-gate property. Pipelines that looked correct yesterday may be incorrect today because (a) upstream source data evolved, (b) a new product type introduced an unforeseen edge case, (c) a downstream consumer's contract changed, or (d) a code change skipped the full edge-case coverage.

Canonical shape¶

Pipeline output  ←──────→  Independent source of truth
       │                            │
       └────────────┬───────────────┘
                    │ SQL (or equivalent)
                    ▼
            Set of metrics:
            - match count
            - mismatch count
            - no-match-either-side count
            - duplicate count
                    │
                    ▼
            Threshold check
            (e.g. 99.99% match)
                    │
                    ▼
            Alert / investigate

Four metric classes (Yelp's canonical set)¶

The 2025-05-27 Yelp Revenue Data Pipeline post names four metrics for integrity checking a revenue pipeline against a billing-system source of truth. These generalise to any pipeline-vs-independent-truth comparison:

Exact match rate — "Number of contracts matching invoice (billed transaction) gross revenue." Yelp's threshold: 99.99%. The primary correctness signal.
Partial mismatch — "Number of contracts with mismatched discount invoice." A soft signal — pipeline and truth agree on existence but disagree on some field value. Often the most diagnostic; systematic mismatches point to specific code bugs.
Left-side orphans — "Number of contract lines with no equivalent invoice." Pipeline produced output, truth didn't. Indicates the pipeline is emitting things the billing system doesn't agree exist (or doesn't bill).
Duplicates / staleness — "Number of duplicate or past contract lines." Pipeline-internal consistency: no double counting; no expired contracts leaking through.

Both sides also get a right-side orphan count implicitly ("Number of invoice with no equivalent contract" in Yelp's sample monthly report) — truth has records the pipeline didn't produce.

Cadence tradeoff¶

Integrity checkers face a cadence tradeoff:

Monthly (or lower-frequency) — the real source of truth is often periodic (Yelp's billing system closes monthly; many financial truths are monthly or quarterly). The full check against the authoritative truth can only run at that cadence.
Daily (or higher-frequency) — waiting a month per feedback cycle makes fast iteration impossible. Daily checks use a proxy source of truth (e.g. the staging pipeline itself) to catch implementation bugs quickly.

Yelp runs both — see patterns/monthly-plus-daily-dual-cadence-integrity-check.

Iterative-discovery property¶

Integrity-checker SQL is not written once. From the 2025-05-27 Yelp post: "Determining the right SQL queries was done iteratively, accounting for different edge cases based on the variety of products offered at Yelp." Each new edge case discovered in production drives a new query clause (or a new metric). The SQL itself is a living representation of the team's understanding of what "correct" means for this pipeline.

Caveats¶

Thresholds are domain-specific. 99.99% match is appropriate for revenue where a 0.01% error of $1B revenue is $100k — tolerable. A safety-critical domain needs 100% reconciliation.
Bilateral mismatch requires bilateral investigation. Left orphans and right orphans have different root causes; reporting one without the other hides half the defect class.
Integrity checkers are not tests. Tests have a known answer; integrity checkers discover unknown answers against an independent dataset. The skill is reading the output and diagnosing, not re-running until green.

concepts/weekly-integrity-reconciliation — a lower- frequency audit loop (e.g. fork-sync automation)
concepts/staging-pipeline — the substrate that enables high-cadence checks

Seen in¶

sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical disclosure. Four named metrics with 99.99% match threshold; monthly-check SQL against Redshift billing tables
daily-check SQL via Redshift Spectrum on staging Glue tables; iterative SQL evolution.