CONCEPT Cited by 1 source
Staging pipeline¶
Definition¶
A staging pipeline is a parallel data-pipeline configuration that runs alongside the production pipeline, consuming production data but executing code under test, with its output written to a separate, verification-friendly substrate so it can be queried without disrupting the production pipeline or its downstream consumers.
It's distinct from:
- Dev-environment testing — runs against dev fixtures, not production data.
- Canary deployment — routes a fraction of production traffic to new code; the staging pipeline routes all production inputs to a parallel execution.
- Shadow traffic (as in request-handling systems) — the batch-pipeline analogue is this staging pipeline; both consume real production inputs without affecting the authoritative output.
Structural shape¶
Production data (single source)
│
┌──────┴──────┐
▼ ▼
Production Staging
pipeline pipeline
(stable) (code under test)
│ │
▼ ▼
Auth. output Verification output
(prod tables) (scratch tables)
The discipline is twofold:
- Prod pipeline is authoritative + untouched. Output is consumed by real downstream systems; its data is never overwritten by staging.
- Staging pipeline is throwaway. Its output is consumed only by verification queries / integrity checks; rollback means reverting staging code, not repairing prod data.
Why this is valuable¶
Three problems a staging pipeline solves that the alternatives don't:
- Dev-fixture edge-case gap. Dev data lacks the diversity of production's long-tail inputs. The staging pipeline exercises the code against real production distributions. ("the different edge cases that occur in production could not be covered during dev testing" — 2025-05-27 Yelp post.)
- Direct-to-prod-testing reversibility tax. If you test in prod and find a bug, you must revert the change and repair any corrupted prod data. With staging, there's nothing to repair — just update staging.
- Parallel-cadence verification. The production output path may have its own publication latency (e.g. Yelp's Redshift Connector at ~10 hours). Staging can publish to a low-latency substrate (Glue + Spectrum) enabling fast verification loops.
The substrate-choice corollary¶
A staging pipeline's value is proportional to how quickly you can query its output. Yelp's choice of AWS Glue + Redshift Spectrum over the production path's Redshift tables is the canonical instance — moving verification latency from ~10 hours to effectively immediate unlocks the same-day code-test-fix loop that makes the staging pipeline worthwhile.
Caveats¶
- Doubles pipeline compute cost. Running the full pipeline twice on production data is a real expense; teams must weigh that against the test-feedback-loop value.
- Access control complexity. Staging reads production data directly — privileged read access to prod must be granted to a pipeline that's exercising code that could be wrong.
- Not a stand-alone safety net. It must be paired with integrity checkers that actually compare staging output against production output. Without the comparison, the parallel pipeline produces unreviewed parallel data.
Seen in¶
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — canonical instance. Yelp runs a staging Revenue Data Pipeline alongside production, writes to Glue tables, queries via Redshift Spectrum, couples with daily integrity checks.
Related¶
- concepts/data-integrity-checker — the comparison mechanism that makes a staging pipeline useful
- concepts/redshift-connector-latency — the specific motivating constraint at Yelp
- patterns/parallel-staging-pipeline-for-prod-verification — the repeatable pattern
- patterns/shadow-migration — sibling pattern for migrations
- systems/yelp-staging-pipeline — canonical implementation