Skip to content

CONCEPT Cited by 1 source

Test data generation for edge cases

Definition

Test data generation for edge cases is the workflow of creating synthetic dev-environment test fixtures that mimic the long-tail input distributions seen only in production. It's the dev-side complement to production data diversity — acknowledging that the dev environment's fixture data is systematically narrower than production's input distribution, and doing something about it.

The problem, stated

From Yelp's 2025-05-27 post:

"Since the development environments have limited data, the different edge cases that occur in production could not be covered during dev testing. This was discovered when the data pipeline was executed in production for the first time."

The gap is specifically between:

  • Dev fixtures — a small, manually curated set of records that passes obvious tests.
  • Production data — the full messy distribution of products, discount combinations, billing schedules, promotional overrides, expired records, and one-off custom contracts that accumulate over years of a live business.

This is the same gap that concepts/production-data-diversity names for ML training data — just applied here to deterministic data pipeline testing rather than model training.

The workflow

Staging pipeline runs on production data
Integrity check catches an edge case
   (e.g. contract with no equivalent invoice)
Engineer diagnoses the edge case
   (e.g. manually-billed product type)
Engineer creates new dev-environment records
   mimicking the production edge case
   (manual — Yelp names this as tedious)
Dev test suite now covers this edge case
   for future code changes

The discipline: every edge case found in prod becomes a dev fixture. Over time the dev fixture set grows to approximate the production distribution's coverage of known failure modes.

Why this is manual (and a known pain point)

Yelp's post explicitly flags this as a maintenance burden:

"The large number of database tables required as input to the pipeline made this process very tedious as it involved manual creation of data points."

The constraint is schema width: if a pipeline consumes dozens of tables and a single edge case involves a coherent state across all of them, creating that fixture by hand is error-prone and slow. Yelp flags automated test data generation as a concrete future improvement.

Complementary patterns

Test-data generation for edge cases doesn't stand alone. It pairs with:

  • concepts/staging-pipeline — the discovery mechanism. Without the staging pipeline surfacing production edge cases, you can't backport them.
  • concepts/data-integrity-checker — the diagnosis mechanism. Without integrity checkers flagging specific discrepancies, the "edge case" is just unexplained drift.
  • Production-data-sampled fixtures — an alternative where sanitised production data (subset + PII-stripped) seeds dev fixtures directly. Yelp's approach is synthesis rather than sampling.

Caveats

  • Coverage is still bounded. Backporting known edge cases doesn't protect against unknown ones. Dev tests catch regressions on previously-found edge cases; new edge cases still surface first in prod / staging.
  • Not a substitute for prod verification. A robust dev test suite + robust prod verification is the right shape; skipping prod verification because "dev covers it" is the failure mode the discipline tries to avoid.
  • Synthesis must preserve the edge case's essence. A fixture that's technically similar but misses the trigger condition isn't testing the right thing. Requires domain knowledge.

Seen in

Last updated · 476 distilled / 1,218 read