Skip to content

PATTERN Cited by 1 source

View over copy for test data environment

Pattern

When creating a new data environment (e.g. per-PR test env), don't copy rows from the source environment — emit SQL views that point at the source environment's tables. Only tables that will be written by the pipeline under test get real tables, optionally seeded with a small partition range copied from the source.

-- Read table: view, no data copy
CREATE VIEW db_attribution_feature1.m_events
AS SELECT * FROM db_attribution_test.m_events;

-- Output table: real table with partition-range seed
-- (produced by the same data-env-creation script, driven by yaml config)

Why

Copying all data from _test into a new per-PR _featureN environment for a realistic-sized pipeline would take hours of Spark jobs + double the storage cost per open PR. Views collapse that to DDL-only operations in seconds, because no rows move.

The asymmetry in a typical pipeline — hundreds of input tables, a handful of output tables — makes the view-based default massively cheaper than uniform copying.

How (Zalando's script)

From sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning, the data-environment creation script:

  1. Reads yaml configs enumerating input + output tables per task.
  2. For each input table: emit CREATE VIEW new_env.table AS SELECT * FROM source_env.table.
  3. For each output table:
    • Create a real (empty) table.
    • If the config declares a seed partition range, copy only those partitions from the source env:
      db_attribution.m_events:
          partitions:
              - date BETWEEN "x" AND "y"
      

Tradeoffs

  • Creation time collapses from hours to seconds.
  • Storage stays O(output tables × seed partitions) per env, not O(all tables).
  • Read-only inputs are automatically fresh — views reflect the source env's current state.
  • Views break if the pipeline-under-test needs to ALTER the schema of a read table — you'd need to materialise it first.
  • Non-deterministic read sources — if the source env's tables are being mutated concurrently, the test run sees those mutations.
  • Dropping or renaming a source table invalidates the view across every dependent env.

Scope

The pattern is most natural on catalog systems that support views on shared storage — Spark/Hive over S3 at Zalando; similarly applicable to Snowflake, BigQuery, Databricks, Iceberg, Delta Lake.

Seen in

Last updated · 550 distilled / 1,221 read