PATTERN Cited by 1 source

View over copy for test data environment¶

Pattern¶

When creating a new data environment (e.g. per-PR test env), don't copy rows from the source environment — emit SQL views that point at the source environment's tables. Only tables that will be written by the pipeline under test get real tables, optionally seeded with a small partition range copied from the source.

-- Read table: view, no data copy
CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events;

-- Output table: real table with partition-range seed
-- (produced by the same data-env-creation script, driven by yaml config)

Why¶

Copying all data from _test into a new per-PR _featureN environment for a realistic-sized pipeline would take hours of Spark jobs + double the storage cost per open PR. Views collapse that to DDL-only operations in seconds, because no rows move.

The asymmetry in a typical pipeline — hundreds of input tables, a handful of output tables — makes the view-based default massively cheaper than uniform copying.

How (Zalando's script)¶

From sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning, the data-environment creation script:

Reads yaml configs enumerating input + output tables per task.
For each input table: emit CREATE VIEW new_env.table AS SELECT * FROM source_env.table.
For each output table:
- Create a real (empty) table.
- If the config declares a seed partition range, copy only those partitions from the source env:
```
db_attribution.m_events:
    partitions:
        - date BETWEEN "x" AND "y"
```

Tradeoffs¶

✅ Creation time collapses from hours to seconds.
✅ Storage stays O(output tables × seed partitions) per env, not O(all tables).
✅ Read-only inputs are automatically fresh — views reflect the source env's current state.
❌ Views break if the pipeline-under-test needs to ALTER the schema of a read table — you'd need to materialise it first.
❌ Non-deterministic read sources — if the source env's tables are being mutated concurrently, the test run sees those mutations.
❌ Dropping or renaming a source table invalidates the view across every dependent env.

Scope¶

The pattern is most natural on catalog systems that support views on shared storage — Spark/Hive over S3 at Zalando; similarly applicable to Snowflake, BigQuery, Databricks, Iceberg, Delta Lake.

patterns/per-pr-airflow-environment-via-dag-versioning — the parent pattern; views-over-copies is the data-layer half.

Seen in¶

sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando's marketing ROI pipeline data-env creation script defaults to view DDL for input tables, partition-range copy for output-table seeds.