PATTERN Cited by 1 source

Performance comparison with scientist¶

Run old (control) and new (candidate) implementations of a critical code path side by side on a sampled fraction of production traffic using a harness like GitHub's scientist library, compare timing, errors, and observables, and use the comparison stream to gate the candidate's promotion from dark-run to live.

Distinct from dark-ship, which focuses on result equivalence. scientist-style harnesses focus on execution equivalence — same result, same latency profile, same error surface. The two are complementary: the dark-ship harness answers "does the new code produce the same answer?", the scientist harness answers "does the new code cost the same to run?"

Shape (scientist idiom, Ruby)¶

science "issues-search-query" do |e|
  e.use     { existing_query_module.search(input) }  # control
  e.try     { new_query_module.search(input) }       # candidate
  e.run_if  { rand < 0.01 }                          # 1% sample
  e.compare { |ctrl, cand| ctrl == cand }            # equality check
end

Behaviour: - The control block runs. Its result is returned to the caller. The user's request is served by the known path. - The candidate block runs — order between control and candidate is randomized per-call to neutralize first-path cache bias. The candidate's exceptions are swallowed before reaching the caller; they're published as part of the experiment. - A publish hook receives a Result object with the two returned values, two durations, and any exceptions, and routes them to the team's metrics / log sink. - A compare block customizes the equality notion; default is ==.

When this is the right tool¶

Refactoring a read path where running both implementations is side-effect-free (or both have the same idempotent side-effects).
Performance regression risk is the main concern. You already trust the new code's correctness from test suites + code review; what you don't have is a pre-GA signal on how it costs in production.
1–5% sampling is acceptable. Scientist approximately doubles critical-path work on sampled requests; 100% sampling is rarely feasible.
The code path is in a Ruby / Python / Go / Java stack with a scientist port or hand-rolled equivalent. The pattern transcends the library, but scientist-class tools bake in exception isolation, random branch ordering, and pluggable publishers that are tedious to redo manually.

When this is the wrong tool¶

Write paths where running the candidate produces side effects. Use patterns/dual-write-migration or try-block idempotent markers instead.
Tight-latency budgets where even 1% doubled-work is unacceptable. Run the candidate offline on a synthetic load-test corpus instead.
Changes with intentional perf regressions (e.g., a security fix that adds a necessary check). The harness produces noise; frame the perf compare against a different baseline.

Interaction with dark-ship¶

For the GitHub Issues search rewrite:

Harness	Sample	What it compares	What signal it produces
Dark-ship	1%	Number of results within ≤1 s	Behaviour parity diffs → bug triage
scientist	1%	Timing / errors / observables	Perf parity regressions → optimisation work

Running both (on disjoint or overlapping sample sets) covers both axes of rewrite risk. The GitHub post is explicit that the team used both. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Publisher design matters¶

Publish to a metrics backend that supports per-experiment dashboards (percentile latencies, error rates, mismatch counts by input category).
Rate-limit mismatch publishing — a bad deploy can produce millions of diffs per minute and nuke your logging tier.
Include enough context (query text, user-context hash, a flag-state snapshot) to make mismatches triable without a re-run; but not so much that the publish becomes a privacy leak.
Separate "expected divergence" from "regression": if the new code intentionally changes behaviour in some known classes, tag those so they aren't alarms.

Trade-offs vs a simpler approach¶

Scientist is heavier than a hand-rolled if rand < 0.01 then log both branch. What you buy:

Candidate-exception isolation (new code failing doesn't take the request down).
Random branch ordering (no systemic cache-warming bias).
A standard publisher interface (dashboards / alerts are reusable across refactors).
An ignore mechanism for deliberate, known divergences.

What you pay:

Library dependency on your critical path.
In-process coupling between control and candidate (not suitable for asynchronous / cross-process comparisons).
Doubled latency on sampled requests unless the candidate is explicitly async'd.

Canonical instance¶

GitHub Issues search rewrite — used scientist to compare performance of equivalent queries between the old flat-parser implementation and the new PEG-AST-recursive implementation on 1% of Issues searches. Verified "there was no regression" on equivalent queries before rolling the new path out to broader API surfaces. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Seen in¶

sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean — canonical in-wiki instance. Paired with patterns/dark-ship-for-behavior-parity for behaviour parity on a separate 1% slice.

systems/scientist — the library the pattern is named after.
patterns/dark-ship-for-behavior-parity — complementary behaviour-parity harness.
patterns/staged-rollout — what scientist's signal gates.
concepts/backward-compatibility — the constraint that motivates the pair of harnesses.