CONCEPT

Non-inferiority test¶

Definition¶

A non-inferiority test is a statistical hypothesis test designed to show that a treatment is not meaningfully worse than the control, within a pre-specified margin. It is the inverse of the standard A/B test's "is treatment better?" question: "is treatment close-enough to control that we can adopt it for non-statistical reasons (cost, simplicity, flexibility) without losing anything that matters?"

Formally: - H0 (null): the treatment is worse than control by more than the non-inferiority margin Δ. - H1 (alternative): the treatment is within Δ of (or better than) control.

Rejecting H0 at a chosen significance level establishes non-inferiority within Δ — but does not establish superiority.

Why A/B programs need it¶

Classical A/B tests with a two-sided t-test (Octopus's default — see systems/octopus-zalando-experimentation-platform) answer "is there a difference?" — either direction. But many real launches ask a different question:

Migration / rewrite: does the new system produce the same business outcome as the old one, within some tolerance? If yes, migrate regardless of whether it is marginally better.
Simplification: can we remove feature X without hurting the OEC by more than Δ? If yes, remove it for maintenance cost savings.
Platform change: does switching the underlying ML model preserve user-facing metrics? Superior is a bonus, equivalent is the pass bar.

Running a two-sided test and accepting "not significantly different" is not equivalent to non-inferiority: failing to reject the null is not evidence of equivalence — it could just mean underpowered. A non-inferiority test has pre-committed to the margin Δ and is powered to reject H0 when H1 is true.

Δ is the key design choice¶

Choosing Δ is a product judgement, not a statistical one: "what effect size is business-meaningful?" Too-large Δ makes non-inferiority trivial; too-small Δ makes it unattainable at feasible sample sizes. Zalando calls out non-inferiority tests as an improvement area identified in peer review (Source: ).

Relation to Bayesian analysis and equivalence testing¶

The same decision problem can be framed Bayesian ("posterior probability that treatment – control > –Δ is > threshold") or as a two-one-sided-tests (TOST) equivalence test (both treatment–control < +Δ and treatment–control > –Δ). Non-inferiority is the one-sided form that matches the "acceptable if not worse" product framing.

Seen in¶

— identified in peer review as an improvement area over Octopus's default two-sided t-test