SYSTEM Cited by 1 source
Yelp Staging Pipeline¶
Definition¶
Yelp Staging Pipeline is the named parallel pipeline configuration Yelp runs alongside the production systems/yelp-revenue-data-pipeline as a verification sandbox. It consumes the same production data, runs the code under test (new changes before they land in production), and publishes output to AWS Glue data-catalog tables on S3 rather than to Redshift — making verification queries runnable immediately via Redshift Spectrum rather than after the ~10-hour Redshift Connector latency.
Disclosed by Yelp Engineering's 2025-05-27 Revenue Automation Series post on integration testing.
Architecture¶
Production data
│
┌───────────┴───────────┐
▼ ▼
Production pipeline Staging pipeline
(stable code) (code under test)
│ │
▼ ▼
Redshift tables AWS Glue data catalog
(via Redshift Connector) tables on S3
│ │
│ ~10 hr latency │ immediate
▼ ▼
Business reporting Redshift Spectrum
ad-hoc verification
│
▼
Data integrity checkers
(daily cadence)
Role¶
The staging pipeline is a verification sandbox, not a deployment stage:
- Isolation from production data. The production pipeline and its Redshift tables remain authoritative; staging writes to a separate Glue-catalog-backed dataset.
- Production-data fidelity. Unlike dev-environment testing, staging ingests real production inputs — catching edge cases that dev fixtures don't cover.
- Low-latency query path. Writes to Glue (seconds) vs the Redshift Connector pipeline (~10 hours) — enables a same-day code-test-fix loop.
- Free rollback. Staging is throwaway; reverting a change means updating the staging pipeline's code.
The key quote: "the production pipeline and its data were left untouched until the new changes were verified before updating it."
Why Glue + Redshift Spectrum, not Redshift?¶
The motivating constraint is the ~10-hour Redshift Connector latency (see redshift-connector-latency). Publishing to Redshift via the connector would make the staging pipeline no faster at surfacing bugs than the production pipeline's verification loop.
The fix: publish to Glue, which is effectively the S3 files plus a metastore entry, and query with Redshift Spectrum directly. No connector involved; no 10-hour wait.
Integrity-checker coupling¶
The staging pipeline is the substrate that makes daily integrity checks possible. The 2025-05-27 post runs its daily integrity SQL against staging-pipeline Glue tables via Redshift Spectrum; monthly integrity SQL runs against Redshift directly because that's where billing-system data lives. See patterns/monthly-plus-daily-dual-cadence-integrity-check.
Test-data feedback loop¶
The staging pipeline surfaces edge cases that dev fixtures miss. When it does, Yelp's engineering process is:
- Staging + integrity check surfaces the discrepancy against the billing system.
- Engineers diagnose the missing edge case.
- New synthetic data point is manually created in the dev environment to replicate the edge case.
- Future dev-environment test runs now cover it.
This is Yelp's workaround for concepts/test-data-generation-for-edge-cases until automated test-data generation is built (flagged as future improvement).
Caveats¶
- Not an isolated environment. Staging consumes production data directly — privileged access to prod reads must be managed.
- Manual-sync verification. "easily comparing staging and production results" — but the actual comparison is SQL the engineer writes, not an automated diff.
- No SLO published. The post says data is "immediately" available after pipeline completion; concrete latency numbers aren't given.
Seen in¶
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical disclosure. Architecture (parallel pipeline with Glue + Spectrum substitute for Redshift), discipline (prod untouched until verified), integrity-checker coupling (daily cadence on staging), test-data feedback loop.
Related¶
- systems/yelp-revenue-data-pipeline — the production pipeline
- systems/yelp-redshift-connector — the connector being bypassed
- systems/aws-glue — staging output catalog
- systems/amazon-redshift-spectrum — verification query path
- systems/aws-s3 — substrate
- companies/yelp
- concepts/staging-pipeline
- concepts/data-integrity-checker
- concepts/redshift-connector-latency
- patterns/parallel-staging-pipeline-for-prod-verification