PATTERN Cited by 1 source
View over copy for test data environment¶
Pattern¶
When creating a new data environment (e.g. per-PR test env), don't copy rows from the source environment — emit SQL views that point at the source environment's tables. Only tables that will be written by the pipeline under test get real tables, optionally seeded with a small partition range copied from the source.
-- Read table: view, no data copy
CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events;
-- Output table: real table with partition-range seed
-- (produced by the same data-env-creation script, driven by yaml config)
Why¶
Copying all data from _test into a new per-PR _featureN environment for a realistic-sized pipeline would take hours of Spark jobs + double the storage cost per open PR. Views collapse that to DDL-only operations in seconds, because no rows move.
The asymmetry in a typical pipeline — hundreds of input tables, a handful of output tables — makes the view-based default massively cheaper than uniform copying.
How (Zalando's script)¶
From sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning, the data-environment creation script:
- Reads yaml configs enumerating input + output tables per task.
- For each input table: emit
CREATE VIEW new_env.table AS SELECT * FROM source_env.table. - For each output table:
- Create a real (empty) table.
- If the config declares a seed partition range, copy only those partitions from the source env:
Tradeoffs¶
- ✅ Creation time collapses from hours to seconds.
- ✅ Storage stays O(output tables × seed partitions) per env, not O(all tables).
- ✅ Read-only inputs are automatically fresh — views reflect the source env's current state.
- ❌ Views break if the pipeline-under-test needs to ALTER the schema of a read table — you'd need to materialise it first.
- ❌ Non-deterministic read sources — if the source env's tables are being mutated concurrently, the test run sees those mutations.
- ❌ Dropping or renaming a source table invalidates the view across every dependent env.
Scope¶
The pattern is most natural on catalog systems that support views on shared storage — Spark/Hive over S3 at Zalando; similarly applicable to Snowflake, BigQuery, Databricks, Iceberg, Delta Lake.
Related patterns¶
- patterns/per-pr-airflow-environment-via-dag-versioning — the parent pattern; views-over-copies is the data-layer half.
Seen in¶
- sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando's marketing ROI pipeline data-env creation script defaults to view DDL for input tables, partition-range copy for output-table seeds.