Skip to content

PATTERN Cited by 1 source

Performance comparison with scientist

Run old (control) and new (candidate) implementations of a critical code path side by side on a sampled fraction of production traffic using a harness like GitHub's scientist library, compare timing, errors, and observables, and use the comparison stream to gate the candidate's promotion from dark-run to live.

Distinct from dark-ship, which focuses on result equivalence. scientist-style harnesses focus on execution equivalence — same result, same latency profile, same error surface. The two are complementary: the dark-ship harness answers "does the new code produce the same answer?", the scientist harness answers "does the new code cost the same to run?"

Shape (scientist idiom, Ruby)

science "issues-search-query" do |e|
  e.use     { existing_query_module.search(input) }  # control
  e.try     { new_query_module.search(input) }       # candidate
  e.run_if  { rand < 0.01 }                          # 1% sample
  e.compare { |ctrl, cand| ctrl == cand }            # equality check
end

Behaviour: - The control block runs. Its result is returned to the caller. The user's request is served by the known path. - The candidate block runs — order between control and candidate is randomized per-call to neutralize first-path cache bias. The candidate's exceptions are swallowed before reaching the caller; they're published as part of the experiment. - A publish hook receives a Result object with the two returned values, two durations, and any exceptions, and routes them to the team's metrics / log sink. - A compare block customizes the equality notion; default is ==.

When this is the right tool

  • Refactoring a read path where running both implementations is side-effect-free (or both have the same idempotent side-effects).
  • Performance regression risk is the main concern. You already trust the new code's correctness from test suites + code review; what you don't have is a pre-GA signal on how it costs in production.
  • 1–5% sampling is acceptable. Scientist approximately doubles critical-path work on sampled requests; 100% sampling is rarely feasible.
  • The code path is in a Ruby / Python / Go / Java stack with a scientist port or hand-rolled equivalent. The pattern transcends the library, but scientist-class tools bake in exception isolation, random branch ordering, and pluggable publishers that are tedious to redo manually.

When this is the wrong tool

  • Write paths where running the candidate produces side effects. Use patterns/dual-write-migration or try-block idempotent markers instead.
  • Tight-latency budgets where even 1% doubled-work is unacceptable. Run the candidate offline on a synthetic load-test corpus instead.
  • Changes with intentional perf regressions (e.g., a security fix that adds a necessary check). The harness produces noise; frame the perf compare against a different baseline.

Interaction with dark-ship

For the GitHub Issues search rewrite:

Harness Sample What it compares What signal it produces
Dark-ship 1% Number of results within ≤1 s Behaviour parity diffs → bug triage
scientist 1% Timing / errors / observables Perf parity regressions → optimisation work

Running both (on disjoint or overlapping sample sets) covers both axes of rewrite risk. The GitHub post is explicit that the team used both. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Publisher design matters

  • Publish to a metrics backend that supports per-experiment dashboards (percentile latencies, error rates, mismatch counts by input category).
  • Rate-limit mismatch publishing — a bad deploy can produce millions of diffs per minute and nuke your logging tier.
  • Include enough context (query text, user-context hash, a flag-state snapshot) to make mismatches triable without a re-run; but not so much that the publish becomes a privacy leak.
  • Separate "expected divergence" from "regression": if the new code intentionally changes behaviour in some known classes, tag those so they aren't alarms.

Trade-offs vs a simpler approach

Scientist is heavier than a hand-rolled if rand < 0.01 then log both branch. What you buy:

  • Candidate-exception isolation (new code failing doesn't take the request down).
  • Random branch ordering (no systemic cache-warming bias).
  • A standard publisher interface (dashboards / alerts are reusable across refactors).
  • An ignore mechanism for deliberate, known divergences.

What you pay:

  • Library dependency on your critical path.
  • In-process coupling between control and candidate (not suitable for asynchronous / cross-process comparisons).
  • Doubled latency on sampled requests unless the candidate is explicitly async'd.

Canonical instance

GitHub Issues search rewrite — used scientist to compare performance of equivalent queries between the old flat-parser implementation and the new PEG-AST-recursive implementation on 1% of Issues searches. Verified "there was no regression" on equivalent queries before rolling the new path out to broader API surfaces. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Seen in

Last updated · 200 distilled / 1,178 read