Skip to content

PATTERN Cited by 1 source

Dark-ship for behaviour parity

Dark-shipping is running a new implementation of a subsystem in parallel with the old one inside the live production request path, for a sampled fraction of real traffic, returning only the old result to the user, and logging behaviour differences for later triage. The new path runs as a background job (sometimes literally Thread.new / async task) that takes the same input, produces its result, and compares. Differences get surfaced to engineers; the user sees a normal response.

Distinct from shadow migration, which runs the new path on a batch / offline workload and reconciles dataset-level statistics; dark-shipping runs against live request-path traffic, in the same process or sidecar, and reconciles per-request results as a debugging signal before GA.

Shape

  1. Gate with a feature flag. The candidate branch runs only for requests matching the flag's sampling predicate (commonly rand < 0.01).
  2. Return the control (old) result to the user. The candidate runs for its diff value, not its user-visible effect. The user sees zero latency impact from candidate errors and an upper-bound latency impact from its extra work (depending on whether the candidate runs synchronously in the request or asynchronously after).
  3. Compare on a correctness-meaningful metric. What to diff is the central design choice:
  4. Result count (GitHub Issues' first-iteration pick): cheapest to log, catches large regressions, misses result-set reordering or silent swap-outs at equal count.
  5. Top-K result IDs: catches ordering changes; ~10-100× more log volume.
  6. Full result set: thorough, expensive to log and diff. Set-diff on IDs only (not bodies) is usually a good compromise.
  7. Structural output (AST, query document, response envelope): for rewrites where the shape matters more than the values.
  8. Bound the comparison window. Only treat two runs as "same intent" when they completed within a short window of each other (GitHub: ≤1 s). Clock-skew and slow candidate runs can otherwise compare against stale user input.
  9. Log asymmetries to a triage stream. Differences become work items: fix bugs, refine the new implementation, or classify the diff as acceptable (e.g., intentional semantics change).
  10. Iterate on the metric. Early: count-diffs. Later: top-K identity, recall-at-K, structural diffs. Moving up the metric ladder catches finer regressions but also costs more to log and triage.

What dark-ship catches that tests miss

  • Real user inputs the test suite didn't anticipate. Queries in the long tail of user behaviour that exist in prod but nowhere in test fixtures.
  • Interaction effects with production data. A query that parses identically might hit a different index shape, a deleted record's tombstone, a schema migration's partial state.
  • Traffic-shape sensitivity. Query-distribution skew that only appears at scale.
  • Clock / cache / state-carrying bugs. Same input, different results from two code paths due to observable time dependence.

What dark-ship doesn't catch

  • Mutation-path bugs. Running a write path twice double-writes unless candidate is idempotent or isolated; dark-ship is a read-path pattern. Use patterns/dual-write-migration or explicit shadow writes for mutation rewrites.
  • Latency-budget regressions. The harness itself consumes CPU on the candidate path; the candidate's effect on production latency is not what dark-shipping measures. Pair with patterns/performance-comparison-with-scientist (or a similar perf harness) to measure that.
  • Backend-load regressions. Running the candidate against the same search engine adds backend load, which a 1% sample may not reveal until scale-up.
  • Rare bugs that need >N% sample to reach. For rare queries, 1% sampling may not cover the failure class — consider running dark-ship at higher % temporarily for corpus-coverage sweeps.

Dark-ship vs scientist

For the GitHub Issues rewrite:

  • dark-ship harness → behaviour parity (result-count diffs on 1% of traffic, triaged before GA).
  • systems/scientist harness → performance parity (old-vs-new latency on a separate 1% of traffic).

Either alone misses half the regression surface. They're complementary, not alternatives.

Canonical instance: GitHub Issues search rewrite

"For 1% of issue searches, we ran the user's search against both the existing and new search systems in a background job, and logged differences in responses. By analyzing these differences we were able to fix bugs and missed edge cases before they reached our users. We weren't sure at the outset how to define 'differences,' but we settled on 'number of results' for the first iteration. In general, it seemed that we could determine whether a user would be surprised by the results of their search against the new search capability if a search returned a different number of results when they were run within a second or less of each other."

(Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Note the staged-metric acknowledgement: "for the first iteration" explicitly frames count-diff as the starting bar, not the ending one. Mature dark-ship harnesses move up the metric ladder as early-iteration bugs shake out.

When to use

  • Critical-path read rewrites with significant semantic risk (search, authorization checks, relevance ranking, permission evaluation, query compilers).
  • Systems with high QPS and bookmarked / shared inputs where a visible regression would be an incident class.
  • Rewrites where the test suite is a known under-approximation of production behaviour.

When not to use

  • Pure write-path rewrites — use patterns/dual-write-migration instead.
  • Low-traffic systems where a 1% sample is too small a stream to catch regressions. Run a wider-fraction synthetic corpus through the candidate offline instead.
  • Changes with intentional semantic drift (e.g., a security fix that deliberately returns fewer results). The diff stream becomes ambiguous — you can't tell intentional changes from bugs. Invert the mental model: dark-ship here is confirming the expected diff shape, not noise.

Seen in

Last updated · 200 distilled / 1,178 read