Skip to content

CONCEPT Cited by 1 source

View-based data environment

Definition

A view-based data environment populates a new data environment not by copying rows from a source environment but by emitting SQL views that point back at the source environment's tables. Only tables written by the pipeline under test get real tables; all read-only tables are views.

Typical DDL (from sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning):

CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events

Why it's cheap

For a data pipeline with hundreds of input tables and a handful of output tables, full copies mean hours of Spark jobs and doubled storage. Views emit as DDL-only operations — no data motion, no extra storage. Creation of a new data environment collapses from hours to seconds.

What gets real tables

Tables that the pipeline under test writes. The logic (per the source post):

"A view is only created if the table is not used as output by one of the respective tasks."

For write tables that also need seed data, Zalando's script takes a partition range in config and copies just that slice:

db_attribution.m_events:
    partitions:
        - date BETWEEN "x" AND "y"

The script materialises these ranges so the pipeline has plausible input for the output step.

When it breaks down

  • If the pipeline under test modifies schema — a view won't accept an ALTER that only exists in the test branch; the table would need to be materialised first.
  • If the read tables are themselves non-deterministic (e.g. being mutated concurrently by other pipelines) — the view exposes those mutations to the test run.
  • If a view's source table is dropped — the view becomes invalid.

None of these bite Zalando's marketing ROI pipeline enough to change the default.

Seen in

Last updated · 550 distilled / 1,221 read