ZALANDO 2024-07-18

Zalando — End-to-end test probes with Playwright¶

Summary¶

Fabien Lavocat (Zalando Engineering, 2024-07-18) documents how Zalando extended its automated end-to-end testing investment into a new primitive: end-to-end test probes — a small set of Playwright scenarios that run on a 30-minute cron job against live production, cover the most important critical customer journeys, and page the on-call team when they fail. The post is anchored in a concrete production lesson: Zalando's existing Cypress-based CI/CD pipeline tests (≈95 % success rate over 120 builds/day) did not catch a major React-hydration regression on product detail pages caused by new headless-CMS content breaking the front-end-to-API-gateway contract, which broke size selection and add-to-cart but "was large enough to have a business impact, but not just [sic] enough to trigger an automated alert." The fix was not more CI tests but a new class of monitoring: periodic Playwright runs that treat the test suite as a synthetic probe of the real production system. To make that viable without generating pager fatigue, Zalando had to (a) reduce the scenario count dramatically to raise reliability past the CI baseline, (b) pick Playwright over Cypress for its auto-wait / auto-retry / tracing primitives, and (c) run the probes in email-only "shadow" mode for weeks until zero false positives, then flip to paging — a shadow-mode alert validation discipline that extends Zalando's established symptom-based alerting stack with a new symptom source.

Key takeaways¶

CI/CD e2e tests are not a monitor for live production. Zalando's existing Cypress-based automated QA pipeline ran on every release (≈120 builds/day, multiple releases/day across the Zalando website, at ~95 % success rate after years of investment). The pipeline caught many bugs pre-release, but by design only exercised "newly built" code — not the running production system as real users see it, which is affected by external factors (headless-CMS content updates, API-gateway contract drift, third-party failures) that no amount of CI rigor can fence against. The 2024 product-detail-page regression was exactly this shape: "new and incomplete content published to our headless CMS which broke the front-end API contract with our API gateway and ultimately led to broken interactivity. [...] React error boundaries [...] weren't working for the eagerly-hydrated part of our product pages." (Source: this article.) Canonical wiki framing: concepts/end-to-end-test-probe — e2e tests repurposed as an external, periodic, black-box monitor of the live system are a distinct primitive from CI/CD e2e tests, even if they share a framework.
Investment quantified: 80 % → 95 % reliability still isn't enough for pager-grade probes. Zalando's Cypress suite started at "around 80 %" success rate — meaning on 120 builds/day, "an average of 24 builds a day which were failing as false positives, causing unnecessary friction." Named remediations that raised it to "the 95 % range" over a multi-year investment: (a) better test setup context for Zalando's highly-dynamic / contextual product pages (including products not yet released for which add-to-cart wasn't triggerable); (b) improved selectors; (c) a hydration-detection mechanism so Cypress wouldn't eagerly execute scripts against a non-interactive UI (Cypress-specific gap — see concepts/react-hydration). Yet 95 % × 30-min cron = ~24 false-positive pages/day at probe altitude. The post makes the explicit arithmetic connection: "if we were to run them every 30 minutes to ensure that our website was working as expected. If we were to page our on-call team upon failures, alerts would trigger several times a day and possibly at night, leading to incident fatigue for the on-call team." See concepts/alert-fatigue.
Simplification as the reliability lever. Zalando's explicit resiliency argument is that fewer scenarios → fewer false positives, not "more retry logic". The probe suite was scoped to a small number of critical customer journeys at the time of publication — named in the post as: (a) land on home page → navigate to gender page → click a product; (b) land on catalog page → apply a filter → click a product; (c) land on product page → select a size → add to cart → start checkout. "By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives." Canonical pattern: patterns/scenario-minimalism-for-probe-reliability — scope the probe to the CBO catalog, not the test matrix. See concepts/test-reliability-through-simplification.
Playwright chosen over Cypress for resiliency primitives. The Zalando Engineering Conference team presented a talk on scaling e2e testing at the internal Zalando Engineering Conference (referenced in this post), which seeded the framework choice. Zalando names five Playwright features as the deciding factors for probe-altitude use: "(a) auto-wait (no artificial timeouts); (b) auto-retry (web assertions), eliminating key causes for flaky tests; (c) rich tooling options (tracing, time-travel) to debug and fix issues if failures occur; (d) a unified API which works across all modern browsers; (e) Typescript out of the box." The auto-wait primitive directly replaces the Cypress-era hydration-detection mechanism. Load-bearing construct: the Playwright Locator auto-wait abstraction (wait for attached → visible → stable → enabled before interacting) makes probe code stable against async rendering without explicit waitForX sprinkled through test code. Zalando's catalog probe example confirms this — it includes one explicit page.waitForTimeout(1000) "to simulate 'real user behavior'" with the annotation "with playwright this is not necessary" — every other wait is implicit.
Shadow mode before paging: a documented discipline. Zalando did not turn paging on when the probe went live. Instead: "We set up the tests to run on a 30 minute cron job and instead of paging immediately when they failed, we created a low-priority alert that emailed the team to validate their reliability using a 'shadow' mode. And it did trigger a couple of times, especially over the weekend. Each time we captured HTML reports as logs so that we could understand the issue, improve our selectors, implement local retry loops with expect.toPass, and even cover tricky edges with selectors targeting non-visible content thanks to Playwright's automatic augmentation of pseudo-classes like :visible. After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed." Canonical pattern: patterns/shadow-mode-alert-before-paging — gate every new alerting source through an email-only validation period that ends only when false-positive rate drops to zero, and use the captured HTML reports / traces / logs from shadow-mode triggers as the iteration signal to fix selectors, add local retries, and handle edge cases. Zalando's shadow-mode period used HTML reports as the primary debugging artifact, per-failure local retry via expect.toPass as the next-layer reliability fix, and Playwright's CSS pseudo-class augmentation (e.g. :visible on non-visible content) as the tricky-edge fix. After promotion to paging: "So far they have only paged us once, and that was during an incident where the page was actually not working." — 0 % false positive rate in production. See concepts/shadow-mode-alert-validation.
E2E test probes as a new CBO symptom source. Zalando's existing Adaptive Paging / Symptom-Based Alerting stack (from 2019, axis 5 of the wiki) alerts on CBO error rate computed from distributed traces. The 2024 product- detail-page incident revealed a blind spot in that stack: the CBO error rate stayed under the alerting threshold because the failure was interactivity (React hydration crashed → size-selector non-functional → add-to-cart unreachable), not a backend 5xx or tracing-error tag. The probe fills that blind spot: each probe scenario is effectively a CBO, and probe failure is a CBO-level symptom that complements trace-derived signals. Post names the planned expansion explicitly: "We are planning to increase the number of scenarios for the end-to-end probes to include more of our Critical Business Operations (CBOs) and we also [are] looking at extending this idea to our mobile apps." This ties axis 5 of the Zalando wiki narrative (Cyber-Week prep → SRE evolution → operation-based SLOs) into a new downstream chapter: the probe as an external, browser-altitude CBO symptom source complementing the internal trace-derived one.
React hydration as a class of bugs that existing defenses miss. The post identifies concepts/react-hydration crashing on the "eagerly- hydrated" part of the product page as the specific failure mode that slipped past all of Zalando's existing defenses. Named contributing factors: (a) React error boundaries don't catch eager-hydration crashes in the same way they catch render errors; (b) CI e2e tests run against a stable test fixture, not live CMS content; (c) service-level monitors don't see front-end interactivity failures when the HTTP 200 still flows; (d) the problem was a contract failure between CMS and API gateway surfacing only at hydration time. This reinforces the post's framing that front-end interactivity is a monitoring gap at the intersection of SSR, hydration, and external content sources — one that browser-altitude e2e test probes close.
The probe shape generalises: small-N scenarios + auto- stable framework + shadow-mode validation. The post's closing advice to peers "keep these tests focused on your critical customer journeys, write good selectors and iterate in a shadow mode before alerting in production" distils the three load-bearing choices into a portable recipe: (a) scope to CBOs, not test-matrix completeness; (b) pick a framework whose core abstractions handle auto-wait / auto-retry / async-rendering at the framework altitude; (c) gate every new probe behind a shadow-mode validation period. The recipe is independent of Playwright specifically — the core insight is the separation of probe-altitude e2e tests from CI-altitude e2e tests, with different success criteria, different scopes, and different operational semantics.

Operational numbers¶

Metric	Value	Note
CI builds/day (Cypress)	~120	Every release goes through CI e2e tests
Cypress start reliability	~80 %	≈24 false-positive failures/day
Cypress post-investment reliability	~95 %	After selector + hydration-detection + test-context fixes
Probe cadence	30 minutes	Cron-triggered Playwright run
Probe scenarios at publication	3 named	Home→gender→product; catalog→filter→product; product→size→cart→checkout
Shadow-mode duration	"a few weeks"	Email-only alerts; iterate on selectors + `expect.toPass` retries
Post-paging false-positive rate	0 %	"Only paged us once, and that was during an incident where the page was actually not working."
Planned expansion	More CBOs + mobile apps	Explicitly listed in Outlook section

Architectural context and caveats¶

Cypress is not deprecated. The post is clear that Cypress remains in the CI pipeline — "each release goes through an automated quality assurance pipeline that includes end-to-end tests written with Cypress." Playwright is added for the probe tier, not substituted for Cypress. The framework split reflects tier-specific requirements: CI needs wide coverage + per-commit execution, probes need narrow scope + extreme reliability.
Third-party framework promotion seeded the choice. Zalando's internal Engineering Conference (Aug 2023; covered in the skipped 2024-06-02 post) included a talk on scaling e2e testing with Playwright. This is a small but notable cross-axis connection — internal technical conferences surface framework choices that later show up in production.
No INP / FID / LCP numbers tied to probe success. The post does not correlate probe-pass rate with business or web- vitals metrics. The paging condition is binary (probe passes / fails); nuance-based thresholds (e.g. page if p95 probe time > X) are not in scope at publication.
The post does not disclose the probe scope's growth curve. At publication, three named scenarios. The outlook names "more CBOs" but does not specify a target count, growth rate, or which CBOs go next.
No cost or runtime numbers disclosed. Cron cadence (30 min), but not per-run cost, wall-clock duration, or the queue depth of parallel browser instances needed to keep the cadence across all scenarios.
Mobile probes flagged as future work. "We also [are] looking at extending this idea to our mobile apps." — no architectural details on how the probe shape translates from Playwright-on-browsers to the native-mobile altitude.
expect.toPass usage scope not quantified. Post names the per-assertion retry primitive as a shadow-mode fix but doesn't say how many scenarios use it, or what the max retry count is.
Selectors targeting non-visible content — the post mentions using Playwright's automatic augmentation of CSS pseudo-classes like :visible as a selector edge-case fix, but doesn't name the specific selectors / scenarios that needed this. The published Playwright docs reference in the post has a typo trailing bracket (]) — noted for source-fidelity.
No root-cause attribution for the 2024 hydration incident. The post names the incident class (hydration crash from headless CMS content breaking the API-gateway contract) but does not give numbers for the incident (how many users affected, duration, revenue impact) nor a full postmortem of the headless-CMS / API-gateway boundary.

Source¶

systems/playwright — the probe runtime; chosen over Cypress for auto-wait / auto-retry / tracing primitives.
systems/cypress — Zalando's pre-existing CI/CD e2e test framework; complements Playwright at a different tier.
systems/react — the front-end framework whose hydration failure mode motivated the probe tier's existence.
concepts/end-to-end-test-probe — canonical concept: e2e test reused as a periodic external black-box monitor of the live production system.
concepts/flaky-test — the pain class the probe design works around.
concepts/shadow-mode-alert-validation — Zalando's gating discipline for promoting a new alerting source.
concepts/test-reliability-through-simplification — resiliency-by-scope-reduction as the primary probe- reliability lever.
concepts/react-hydration — the interactivity-failure class that motivated the probe.
concepts/critical-business-operation — probe scenarios are CBOs at browser altitude.
concepts/alert-fatigue — the load-bearing pager- sustainability constraint.
concepts/symptom-based-alerting — Zalando's alerting strategy; probes are a new symptom source.
patterns/e2e-test-as-synthetic-probe — the core pattern.
patterns/shadow-mode-alert-before-paging — the gating pattern.
patterns/scenario-minimalism-for-probe-reliability — the scoping pattern.
patterns/scheduled-cron-triggered-load-test — sibling pattern at the load-test altitude (2021-03-01 Zalando post); same cron-triggered-declarative-test shape applied to a different problem.
companies/zalando