CONCEPT

View-based data environment¶

Definition¶

A view-based data environment populates a new data environment not by copying rows from a source environment but by emitting SQL views that point back at the source environment's tables. Only tables written by the pipeline under test get real tables; all read-only tables are views.

Typical DDL (from ):

CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events

Why it's cheap¶

For a data pipeline with hundreds of input tables and a handful of output tables, full copies mean hours of Spark jobs and doubled storage. Views emit as DDL-only operations — no data motion, no extra storage. Creation of a new data environment collapses from hours to seconds.

What gets real tables¶

Tables that the pipeline under test writes. The logic (per the source post):

"A view is only created if the table is not used as output by one of the respective tasks."

For write tables that also need seed data, Zalando's script takes a partition range in config and copies just that slice:

db_attribution.m_events:
    partitions:
        - date BETWEEN "x" AND "y"

The script materialises these ranges so the pipeline has plausible input for the output step.

When it breaks down¶

If the pipeline under test modifies schema — a view won't accept an ALTER that only exists in the test branch; the table would need to be materialised first.
If the read tables are themselves non-deterministic (e.g. being mutated concurrently by other pipelines) — the view exposes those mutations to the test run.
If a view's source table is dropped — the view becomes invalid.

None of these bite Zalando's marketing ROI pipeline enough to change the default.

Seen in¶

— Zalando's create data environment script defaults to view DDL, escalates to partition-range copy only for output tables that need seed data.