PATTERN Cited by 1 source
Performance comparison with scientist¶
Run old (control) and new (candidate) implementations of a critical code path side by side on a sampled fraction of production traffic using a harness like GitHub's scientist library, compare timing, errors, and observables, and use the comparison stream to gate the candidate's promotion from dark-run to live.
Distinct from dark-ship,
which focuses on result equivalence. scientist-style
harnesses focus on execution equivalence — same result, same
latency profile, same error surface. The two are complementary:
the dark-ship harness answers "does the new code produce the
same answer?", the scientist harness answers "does the new code
cost the same to run?"
Shape (scientist idiom, Ruby)¶
science "issues-search-query" do |e|
e.use { existing_query_module.search(input) } # control
e.try { new_query_module.search(input) } # candidate
e.run_if { rand < 0.01 } # 1% sample
e.compare { |ctrl, cand| ctrl == cand } # equality check
end
Behaviour:
- The control block runs. Its result is returned to the
caller. The user's request is served by the known path.
- The candidate block runs — order between control and
candidate is randomized per-call to neutralize first-path
cache bias. The candidate's exceptions are swallowed before
reaching the caller; they're published as part of the
experiment.
- A publish hook receives a Result object with the two
returned values, two durations, and any exceptions, and
routes them to the team's metrics / log sink.
- A compare block customizes the equality notion; default
is ==.
When this is the right tool¶
- Refactoring a read path where running both implementations is side-effect-free (or both have the same idempotent side-effects).
- Performance regression risk is the main concern. You already trust the new code's correctness from test suites + code review; what you don't have is a pre-GA signal on how it costs in production.
- 1–5% sampling is acceptable. Scientist approximately doubles critical-path work on sampled requests; 100% sampling is rarely feasible.
- The code path is in a Ruby / Python / Go / Java stack with a scientist port or hand-rolled equivalent. The pattern transcends the library, but scientist-class tools bake in exception isolation, random branch ordering, and pluggable publishers that are tedious to redo manually.
When this is the wrong tool¶
- Write paths where running the candidate produces
side effects. Use patterns/dual-write-migration or
try-block idempotent markers instead. - Tight-latency budgets where even 1% doubled-work is unacceptable. Run the candidate offline on a synthetic load-test corpus instead.
- Changes with intentional perf regressions (e.g., a security fix that adds a necessary check). The harness produces noise; frame the perf compare against a different baseline.
Interaction with dark-ship¶
For the GitHub Issues search rewrite:
| Harness | Sample | What it compares | What signal it produces |
|---|---|---|---|
| Dark-ship | 1% | Number of results within ≤1 s | Behaviour parity diffs → bug triage |
| scientist | 1% | Timing / errors / observables | Perf parity regressions → optimisation work |
Running both (on disjoint or overlapping sample sets) covers both axes of rewrite risk. The GitHub post is explicit that the team used both. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)
Publisher design matters¶
- Publish to a metrics backend that supports per-experiment dashboards (percentile latencies, error rates, mismatch counts by input category).
- Rate-limit mismatch publishing — a bad deploy can produce millions of diffs per minute and nuke your logging tier.
- Include enough context (query text, user-context hash, a flag-state snapshot) to make mismatches triable without a re-run; but not so much that the publish becomes a privacy leak.
- Separate "expected divergence" from "regression": if the new code intentionally changes behaviour in some known classes, tag those so they aren't alarms.
Trade-offs vs a simpler approach¶
Scientist is heavier than a hand-rolled if rand < 0.01 then
log both branch. What you buy:
- Candidate-exception isolation (new code failing doesn't take the request down).
- Random branch ordering (no systemic cache-warming bias).
- A standard publisher interface (dashboards / alerts are reusable across refactors).
- An
ignoremechanism for deliberate, known divergences.
What you pay:
- Library dependency on your critical path.
- In-process coupling between control and candidate (not suitable for asynchronous / cross-process comparisons).
- Doubled latency on sampled requests unless the candidate is explicitly async'd.
Canonical instance¶
GitHub Issues search rewrite — used scientist to compare performance of equivalent queries between the old flat-parser implementation and the new PEG-AST-recursive implementation on 1% of Issues searches. Verified "there was no regression" on equivalent queries before rolling the new path out to broader API surfaces. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)
Seen in¶
- sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean — canonical in-wiki instance. Paired with patterns/dark-ship-for-behavior-parity for behaviour parity on a separate 1% slice.
Related¶
- systems/scientist — the library the pattern is named after.
- patterns/dark-ship-for-behavior-parity — complementary behaviour-parity harness.
- patterns/staged-rollout — what scientist's signal gates.
- concepts/backward-compatibility — the constraint that motivates the pair of harnesses.