CONCEPT Cited by 1 source
Overall Evaluation Criterion (OEC)¶
Definition¶
The Overall Evaluation Criterion (OEC) is the single composite metric an A/B test (or family of tests) is decided on — the metric that, if it moves significantly in the right direction, authorises the launch, and if it doesn't, blocks it.
The term comes from the Kohavi / Microsoft experimentation literature. It is deliberately one metric per decision, not a dashboard: a dashboard invites post-hoc cherry-picking of whichever metric moved favourably.
Why OEC selection is hard¶
Choosing an OEC is one of the hardest parts of running A/B tests:
- Short-term proxies mislead. CTR or session length moves easily but doesn't always predict retention or LTV.
- Revenue is noisy. Short-window revenue is dominated by high-variance outliers; small effect sizes require huge samples.
- Team-scope matters. A team's OEC should be controllable by that team's product changes — otherwise moving it is luck, not cause.
- Gaming is inevitable. Any single metric, if incentivised, will be optimised at the expense of unmeasured ones.
Zalando's 2021 retrospective (Source: sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution) explicitly names OEC / KPI selection as "a big pain point" for users — the motivation for Octopus's guideline work.
Zalando's qualitative guidelines¶
Zalando's Run-phase guidance for teams picking OECs:
-
Team-specific — the KPI must be sensitive to the product the team controls. A team changing search ranking should choose an OEC that search ranking can plausibly drive, not a company-wide revenue number it has no reach into.
-
Long-term-proxy over short-term-revenue — KPIs should be chosen as proxies for long-term customer lifetime value (LTV), not short-term revenue. Optimising a short-term metric may trade against the long-term objective.
Zalando plans to incorporate these guidelines into Octopus with scientifically proven methods — i.e. move from qualitative documents to platform-enforced choices with specific recommended OECs per team domain.
The relationship to LTV¶
A short-term metric $m$ is a useful OEC iff it is predictive of LTV:
$$E[\text{LTV} \mid m = m_t] - E[\text{LTV} \mid m = m_c]$$
has the same sign — and ideally the same magnitude ratio — as the true long-term treatment effect. Operationalising this requires historical data tying $m$ to LTV, which is a separate modelling problem from running the A/B test itself. See concepts/surrogacy-causal-inference for a formal treatment of short-term-surrogate → long-term-outcome inference.
OEC ≠ guardrail metrics¶
A test should have one OEC (deciding metric) plus many guardrail metrics (can't-regress metrics). Guardrails catch cases where the OEC moves favourably at unacceptable cost to adjacent concerns (latency, engagement with non-target surfaces, accessibility). See patterns/ab-test-rollout for Atlassian's percentile-guardrail implementation.
Seen in¶
- sources/2021-01-11-zalando-experimentation-platform-at-zalando-part-1-evolution — team-specific + LTV-proxy guidelines as Run-phase platform guidance