Skip to content

SYSTEM Cited by 1 source

Yelp Staging Pipeline

Definition

Yelp Staging Pipeline is the named parallel pipeline configuration Yelp runs alongside the production systems/yelp-revenue-data-pipeline as a verification sandbox. It consumes the same production data, runs the code under test (new changes before they land in production), and publishes output to AWS Glue data-catalog tables on S3 rather than to Redshift — making verification queries runnable immediately via Redshift Spectrum rather than after the ~10-hour Redshift Connector latency.

Disclosed by Yelp Engineering's 2025-05-27 Revenue Automation Series post on integration testing.

Architecture

            Production data
       ┌───────────┴───────────┐
       ▼                       ▼
Production pipeline       Staging pipeline
(stable code)             (code under test)
       │                       │
       ▼                       ▼
Redshift tables         AWS Glue data catalog
(via Redshift Connector) tables on S3
       │                       │
       │ ~10 hr latency        │ immediate
       ▼                       ▼
Business reporting      Redshift Spectrum
                        ad-hoc verification
                       Data integrity checkers
                       (daily cadence)

Role

The staging pipeline is a verification sandbox, not a deployment stage:

  • Isolation from production data. The production pipeline and its Redshift tables remain authoritative; staging writes to a separate Glue-catalog-backed dataset.
  • Production-data fidelity. Unlike dev-environment testing, staging ingests real production inputs — catching edge cases that dev fixtures don't cover.
  • Low-latency query path. Writes to Glue (seconds) vs the Redshift Connector pipeline (~10 hours) — enables a same-day code-test-fix loop.
  • Free rollback. Staging is throwaway; reverting a change means updating the staging pipeline's code.

The key quote: "the production pipeline and its data were left untouched until the new changes were verified before updating it."

Why Glue + Redshift Spectrum, not Redshift?

The motivating constraint is the ~10-hour Redshift Connector latency (see redshift-connector-latency). Publishing to Redshift via the connector would make the staging pipeline no faster at surfacing bugs than the production pipeline's verification loop.

The fix: publish to Glue, which is effectively the S3 files plus a metastore entry, and query with Redshift Spectrum directly. No connector involved; no 10-hour wait.

Integrity-checker coupling

The staging pipeline is the substrate that makes daily integrity checks possible. The 2025-05-27 post runs its daily integrity SQL against staging-pipeline Glue tables via Redshift Spectrum; monthly integrity SQL runs against Redshift directly because that's where billing-system data lives. See patterns/monthly-plus-daily-dual-cadence-integrity-check.

Test-data feedback loop

The staging pipeline surfaces edge cases that dev fixtures miss. When it does, Yelp's engineering process is:

  1. Staging + integrity check surfaces the discrepancy against the billing system.
  2. Engineers diagnose the missing edge case.
  3. New synthetic data point is manually created in the dev environment to replicate the edge case.
  4. Future dev-environment test runs now cover it.

This is Yelp's workaround for concepts/test-data-generation-for-edge-cases until automated test-data generation is built (flagged as future improvement).

Caveats

  • Not an isolated environment. Staging consumes production data directly — privileged access to prod reads must be managed.
  • Manual-sync verification. "easily comparing staging and production results" — but the actual comparison is SQL the engineer writes, not an automated diff.
  • No SLO published. The post says data is "immediately" available after pipeline completion; concrete latency numbers aren't given.

Seen in

Last updated · 476 distilled / 1,218 read