Skip to content

PATTERN Cited by 1 source

Automated SRM alert

Problem

A/B test results are only as trustworthy as the groups being compared. When the actual user split deviates from the designed split — concepts/sample-ratio-mismatch (SRM) — every downstream metric comparison becomes statistically invalid, regardless of how good it looks.

But SRM:

  • Is easy to forget to check (it's a different chi-squared test from the one being interpreted).
  • Doesn't announce itself — the dashboards still render.
  • Is endemic: peer companies report 6–10% of A/B tests affected; Zalando's historical rate was 20%+ (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution).
  • Often traces back to pipeline / tracking bugs, not randomization — so the test owner (product team) is not the right person to diagnose it.

Letting teams run SRM checks themselves just means most tests don't get checked. The platform has to do it.

Solution

Automate SRM detection and alerting as a first-class platform feature:

  1. Always run the SRM test over the collected assignment events of every running / completed experiment, comparing observed allocation vs designed allocation with a chi-squared (or equivalent) test.
  2. Set a low threshold (e.g. p < 0.001 on the chi-squared — SRM tests are run on every experiment, so a tight threshold is needed to control family-wise false-positive rate).
  3. Alert the affected team on detection — push notification to the experiment owner, not just a flag on a dashboard.
  4. Gate result publication — Octopus requires further data investigation before analysis results are shown to users. The platform's dashboard refuses to confidently render the A/B comparison; it shows the SRM warning instead.
  5. Track SRM rate over time as a platform-health metric — a rising trend is a leading indicator of tracking-pipeline drift org-wide, not just a problem with one test.

Zalando's applied case

Octopus (see systems/octopus-zalando-experimentation-platform) "automatically raises alerts to the affected team when sample ratio mismatch is detected. Further data investigation will be needed before analysis results are shown to users in the platform's dashboard." (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution).

Zalando's 20% SRM rate — roughly 2× the industry's 6–10% — was diagnostic of an org-level data-tracking consistency problem, not an experimentation issue. Fixing the symptom (SRM-affected tests) would have required fixing the cause (cross-team schema drift in tracking events); Zalando pursued both. The platform-level SRM alert continues to catch residual cases.

Why this is a pattern, not just a feature

Any centralised experimentation platform big enough to need patterns/centralized-experimentation-platform will eventually re-discover SRM as its dominant trust problem. The response is predictable: auto-detect, alert, gate results. Platforms that don't automate this inevitably ship experiments whose results are wrong in ways nobody noticed — which destroys the trust the central platform was supposed to build.

Once the SRM alert is in place, add:

  • Pre-period SRM — run the same test on a pre-experiment window to catch randomization engine drift before an experiment even launches.
  • Differential filtering — surface the ratio of events dropped by downstream filters per variant.
  • Coverage gap — surface users who were assigned but never appeared in telemetry.

These turn into a data-quality dashboard that's part of the platform, not the individual team's concern.

Seen in

Last updated · 476 distilled / 1,218 read