CONCEPT Cited by 1 source
View-based data environment¶
Definition¶
A view-based data environment populates a new data environment not by copying rows from a source environment but by emitting SQL views that point back at the source environment's tables. Only tables written by the pipeline under test get real tables; all read-only tables are views.
Typical DDL (from sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning):
Why it's cheap¶
For a data pipeline with hundreds of input tables and a handful of output tables, full copies mean hours of Spark jobs and doubled storage. Views emit as DDL-only operations — no data motion, no extra storage. Creation of a new data environment collapses from hours to seconds.
What gets real tables¶
Tables that the pipeline under test writes. The logic (per the source post):
"A view is only created if the table is not used as output by one of the respective tasks."
For write tables that also need seed data, Zalando's script takes a partition range in config and copies just that slice:
The script materialises these ranges so the pipeline has plausible input for the output step.
When it breaks down¶
- If the pipeline under test modifies schema — a view won't accept an
ALTERthat only exists in the test branch; the table would need to be materialised first. - If the read tables are themselves non-deterministic (e.g. being mutated concurrently by other pipelines) — the view exposes those mutations to the test run.
- If a view's source table is dropped — the view becomes invalid.
None of these bite Zalando's marketing ROI pipeline enough to change the default.
Related¶
- concepts/data-environment
- concepts/pipeline-environment
- concepts/per-pr-ephemeral-environment
- patterns/view-over-copy-for-test-data-environment
- systems/apache-spark
- systems/aws-s3
Seen in¶
- sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando's
create data environmentscript defaults to view DDL, escalates to partition-range copy only for output tables that need seed data.