Skip to content

CONCEPT

View-based data environment

Definition

A view-based data environment populates a new data environment not by copying rows from a source environment but by emitting SQL views that point back at the source environment's tables. Only tables written by the pipeline under test get real tables; all read-only tables are views.

Typical DDL (from ):

CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events

Why it's cheap

For a data pipeline with hundreds of input tables and a handful of output tables, full copies mean hours of Spark jobs and doubled storage. Views emit as DDL-only operations — no data motion, no extra storage. Creation of a new data environment collapses from hours to seconds.

What gets real tables

Tables that the pipeline under test writes. The logic (per the source post):

"A view is only created if the table is not used as output by one of the respective tasks."

For write tables that also need seed data, Zalando's script takes a partition range in config and copies just that slice:

db_attribution.m_events:
    partitions:
        - date BETWEEN "x" AND "y"

The script materialises these ranges so the pipeline has plausible input for the output step.

When it breaks down

  • If the pipeline under test modifies schema — a view won't accept an ALTER that only exists in the test branch; the table would need to be materialised first.
  • If the read tables are themselves non-deterministic (e.g. being mutated concurrently by other pipelines) — the view exposes those mutations to the test run.
  • If a view's source table is dropped — the view becomes invalid.

None of these bite Zalando's marketing ROI pipeline enough to change the default.

Seen in

  • — Zalando's create data environment script defaults to view DDL, escalates to partition-range copy only for output tables that need seed data.
Last updated · 542 distilled / 1,571 read