Skip to content

PATTERN Cited by 1 source

A/B-Test Rollout with Percentile Guardrails

When shipping a performance-sensitive change whose impact is noisy across client environments, run it as a controlled A/B experiment rather than a fixed-timeline rollout, and hold the rollout until per-percentile metrics (p50, p90, p99) move in the expected direction on both topline and guardrail dimensions. Complements but does not replace patterns/staged-rollout — staged rollout is about blast radius; A/B rollout is about attributing cause to the change and catching regressions the topline hides.

Mechanics

  1. Instrument many browser-side metrics (Atlassian captures a broad set: FCP, TTVC/VC90, TTI, hydration success rate, etc.) — enough to catch secondary regressions.
  2. Split traffic into treatment and control. Track each metric at p50 / p90 / p99; tails often move in the opposite direction to medians under subtle regressions.
  3. Riskier changes get slower rollouts ("several weeks" for Confluence's streaming SSR). Complexity of change, not size of expected win, dictates pace.
  4. Hold release until statistical significance across the chosen percentiles; only then expand.

Why topline-only rollouts fail

The Confluence team almost shipped streaming SSR with a latent TTI regression caused by the React-18 context/Suspense re-render bug (see concepts/react-hydration) — topline FCP was up ~40%, the headline win, but TTI at p90 was degrading. Guardrail metrics + per-percentile tracking surfaced the regression before GA; they fixed it with unstable_scheduleHydration and continued the rollout. (Source: sources/2026-04-16-atlassian-streaming-ssr-confluence)

Lesson: performance A/B tests need more metrics than you think. Some of them exist specifically to catch you ruining something else while winning the advertised metric.

When to use

  • Large cross-cutting changes (protocol/runtime swap, rendering-mode change, compression-algorithm switch) with diffuse client-side impact.
  • Changes where the expected-win metric is partly correlated with metrics you don't want to regress.
  • Any "riskier" change by the team's own judgement — complexity is the signal, not projected impact.

Relation to other rollout patterns

  • patterns/staged-rollout — environment/zone/pod-%/user-% stages, each with its own evaluation. Used alongside A/B; e.g., staged by region, A/B within each stage.
  • patterns/fast-rollback — still required; A/B test tells you when to roll back, the rollback mechanism lets you do it quickly.

Seen in

Last updated · 200 distilled / 1,178 read