CONCEPT Cited by 1 source

Automated Test Generation from Production Traffic¶

Definition¶

Automated test generation from production traffic is the discipline of producing a representative test set for an offline evaluation by sampling from production request / event logs rather than authoring handcrafted test cases.

The distinguishing property: the test set's coverage is empirically bounded by what real users actually did, which keeps it aligned with the production distribution and cheap to refresh.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Why teams adopt it¶

Handcrafted test cases have predictable failure modes:

Stale coverage. Authored once, they drift from production behaviour as users change what they do.
Author bias. The set reflects what the author thinks users do, not what they do.
Cross-market transfer is hard. Tests written for one market can't be cheaply lifted to another — which is precisely the pre-launch case.

Zalando names the cost directly: "Manual tests are not very scalable and could be biased. Also it is hard to transfer written test cases from one market to another. To reduce this effort, we want to automate the test generation, while still being able to add or customise the test cases if we need to."

Shape of the pipeline (generalised)¶

Production emits events into a substrate. (In Zalando's case, every processed search request publishes to Nakadi → Data Lake.)
Offline OLAP pipeline consumes the events. Clustering, deduplication, sampling, ranking operations that the production hot path cannot afford.
Test-set materialisation. Output is a structured test set, versioned, with provenance (which queries came from which market, which time window).
Optional transformation. Translation, rewriting, or synthetic augmentation on top of the sampled real queries — for cases where the production traffic doesn't directly cover the target scenario.
Override hook for handcrafted additions. "while still being able to add or customise the test cases if we need to" — critical-but-rare scenarios can still be hand-authored and appended.

Selection policies¶

Traffic-share ranking. Popular scenarios first (Zalando's choice).
Long-tail sampling. Uniform sampling across distinct-intent scenarios regardless of volume — better for low-frequency regression detection.
Error-biased sampling. Preferentially sample queries that historically produced low CTR / high abandonment — useful when the goal is regression-in-known-bad-areas.
Hybrid. Combinations of the above, weighted.

Zalando's policy is traffic-share with NER-clustering first — cluster then rank.

Preconditions¶

Observable production. Requests or events must be durably captured and queryable downstream of the hot path.
PII / compliance review. Real-user traffic carries identity. An evaluation pipeline operating on it must respect retention, anonymisation, and jurisdictional constraints.
Index for replay. Test-set generation is often one pass; the downstream evaluation may re-replay it many times. The sampled set should be persistent and versioned.

Record-and-replay testing for protocol compatibility (replaying captured traffic against a new implementation).
Property-based testing from production corpora (using observed traffic to seed generators).
Shadow-traffic comparison testing (running the new system on live-mirrored traffic) — orthogonal; that's online, this is offline.

Seen in¶

sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — canonical wiki instance. Zalando samples existing-market production search traffic (via Nakadi + Data Lake) and clusters by NER-tag set to generate new-market test queries, with LLM-translation as the transformation step.

concepts/ner-clustered-query-sampling — the specific clustering strategy Zalando uses on top.
concepts/pre-launch-market-validation — the downstream application.
systems/zalando-search-query-clustering
systems/nakadi — the event-bus substrate Zalando samples from.
companies/zalando