CONCEPT

Overall Evaluation Criterion (OEC)¶

Definition¶

The Overall Evaluation Criterion (OEC) is the single composite metric an A/B test (or family of tests) is decided on — the metric that, if it moves significantly in the right direction, authorises the launch, and if it doesn't, blocks it.

The term comes from the Kohavi / Microsoft experimentation literature. It is deliberately one metric per decision, not a dashboard: a dashboard invites post-hoc cherry-picking of whichever metric moved favourably.

Why OEC selection is hard¶

Choosing an OEC is one of the hardest parts of running A/B tests:

Short-term proxies mislead. CTR or session length moves easily but doesn't always predict retention or LTV.
Revenue is noisy. Short-window revenue is dominated by high-variance outliers; small effect sizes require huge samples.
Team-scope matters. A team's OEC should be controllable by that team's product changes — otherwise moving it is luck, not cause.
Gaming is inevitable. Any single metric, if incentivised, will be optimised at the expense of unmeasured ones.

Zalando's 2021 retrospective (Source: ) explicitly names OEC / KPI selection as "a big pain point" for users — the motivation for Octopus's guideline work.

Zalando's qualitative guidelines¶

Zalando's Run-phase guidance for teams picking OECs:

Team-specific — the KPI must be sensitive to the product the team controls. A team changing search ranking should choose an OEC that search ranking can plausibly drive, not a company-wide revenue number it has no reach into.
Long-term-proxy over short-term-revenue — KPIs should be chosen as proxies for long-term customer lifetime value (LTV), not short-term revenue. Optimising a short-term metric may trade against the long-term objective.

Zalando plans to incorporate these guidelines into Octopus with scientifically proven methods — i.e. move from qualitative documents to platform-enforced choices with specific recommended OECs per team domain.

The relationship to LTV¶

A short-term metric $m$ is a useful OEC iff it is predictive of LTV:

$$E[\text{LTV} \mid m = m_t] - E[\text{LTV} \mid m = m_c]$$

has the same sign — and ideally the same magnitude ratio — as the true long-term treatment effect. Operationalising this requires historical data tying $m$ to LTV, which is a separate modelling problem from running the A/B test itself. See concepts/surrogacy-causal-inference for a formal treatment of short-term-surrogate → long-term-outcome inference.

OEC ≠ guardrail metrics¶

A test should have one OEC (deciding metric) plus many guardrail metrics (can't-regress metrics). Guardrails catch cases where the OEC moves favourably at unacceptable cost to adjacent concerns (latency, engagement with non-target surfaces, accessibility). See patterns/ab-test-rollout for Atlassian's percentile-guardrail implementation.

Seen in¶

— team-specific + LTV-proxy guidelines as Run-phase platform guidance