PATTERN Cited by 1 source
Monthly plus daily dual-cadence integrity check¶
Summary¶
Run two integrity-check loops at different cadences against the same pipeline output:
- A low-frequency, authoritative check against the real source of truth (e.g. monthly against a billing system that closes monthly).
- A high-frequency, proxy check against a proxy source of truth (e.g. daily against the staging pipeline's own output).
The authoritative check catches what really matters; the proxy check catches fast-iterating implementation bugs in the interim.
When to use¶
Two conditions must hold:
- The real source of truth updates at a low cadence — e.g. a billing system that closes monthly, a financial ERP that reconciles quarterly, a physical-inventory count that runs annually.
- You cannot afford the real-source cadence as your iteration cycle — waiting a month to know if a pipeline fix works makes development impossibly slow.
Structure¶
Pipeline output
│
├──→ Daily check ──→ staging-pipeline result
│ (lightweight metrics; (same-day feedback
│ bug-finding) loop)
│
└──→ Monthly check ──→ billing-system close
(full metric suite; (authoritative answer;
99.99% threshold) month-end cadence)
Why you need both¶
- Only monthly: catches real errors but feedback loop is 30 days. A bug introduced on day 1 contaminates a month of output before the next check.
- Only daily: can iterate fast but you're not comparing against the real truth, so "daily passes" doesn't guarantee "monthly will pass."
- Both: daily catches implementation regressions within 24 hours; monthly confirms the real business correctness.
Yelp's canonical implementation (2025-05-27)¶
Daily check¶
- Substrate: Staging Pipeline output on AWS Glue tables.
- Query path: Redshift Spectrum (bypasses the ~10-hour Redshift Connector latency).
- Metric set: lightweight sanity checks against pipeline- internal invariants:
---- Metrics ----
number of contracts with negative revenue: 0
number of contracts passed or expired: 28
number of contracts with unknown program: 0
number of contracts missing parent category: 66
- Action on failure: "If any discrepancies were discovered, we would swiftly determine the cause, correct it, and rerun the pipeline for the day."
Monthly check¶
- Substrate: Revenue Recognition output in Redshift tables.
- Comparison: pipeline-produced contract revenue vs billing-system billed revenue.
- Metric set: full suite, four named metric classes — contracts matching both gross + net, mismatched discount, mismatched gross, orphans in either direction.
- Threshold: 99.99% contract-invoice match rate required to declare system reliability.
- Sample output shape (from the post):
---- Metrics ----
Number of contracts reviewed: 100000
Gross Revenue: xxxx
Net Revenue: xxxx
Number of contracts matching both gross & net: 99799
Number of contracts with mismatch discount: 24
Number of contracts with mismatch gross revenue: 6
Number of contracts with no equivalent invoice: 1
Number of invoice with no equivalent contract: 10
Line match %: 99.997
Net revenue mismatch difference: 1883.12
Why Yelp picked these exact cadences¶
From the post: "The Revenue Data Pipeline is designed to send data for revenue recognition on a daily basis and the ideal data source of truth is the revenue data produced each month by Yelp's billing system. To resolve this, we decided to develop both monthly and daily checks."
Trade-offs¶
- Metric-set divergence is intentional. Daily checks don't replicate the monthly metric suite — they'd be meaningless against the proxy. Daily is pipeline-internal sanity; monthly is business truth.
- Two SQL surfaces to maintain. Both check suites evolve iteratively as new edge cases surface.
- Threshold calibration. 99.99% at Yelp's scale is tolerable financially; a smaller business might accept 99.9% at lower cost.
- Doesn't help if the proxy and truth diverge. If the staging pipeline has the same bug as production, daily checks look green even when monthly will fail. Mitigated by keeping staging + production code diff small.
Related patterns¶
- concepts/weekly-integrity-reconciliation — the generic cadence-complementarity pattern (fast path + slow audit path). This pattern is the data-pipeline specialisation.
- patterns/parallel-staging-pipeline-for-prod-verification — the substrate enabling the daily cadence.
Seen in¶
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical instance. Yelp Revenue Data Pipeline runs both cadences; concrete metric sets + thresholds disclosed. Pattern name canonicalised here.