Skip to content

CONCEPT Cited by 1 source

Data environment

Definition

A data environment is a named set of databases / tables / views (all sharing a naming suffix) that a pipeline environment reads from and writes to. It is the data-layer half of an isolated end-to-end pipeline run.

From sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning:

"A data environment is a set of Spark/Hive databases, tables and views. A pipeline environment uses a single data environment for reading and writing data."

At Zalando the naming convention is a suffix: _live, _test, _feature1. Example database names: db_attribution_live, db_attribution_test, db_attribution_feature1.

Why the abstraction matters

Without data isolation, a per-PR pipeline environment still collides with every other PR on the shared _test tables: two in-flight features both writing to db_attribution_test.m_events corrupt each other's runs. Versioning the compute layer without versioning the data layer does not give true end-to-end isolation.

Implementation strategies

  1. Copy all rows from source env → new env. Correct but slow and expensive (hours + storage cost) for large tables.
  2. View-based — create CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events for every read table. Cheap (DDL only, no data copy). Only output tables (those written by tasks in this pipeline) get real tables, optionally seeded by a partition-range copy.

Zalando's data-environment creation script defaults to (2) — the view-over-copy pattern — because the vast majority of tables in a pipeline are inputs, not outputs.

1-to-1 with pipeline environment

A pipeline environment consumes exactly one data environment. The pairing is what makes the end-to-end isolation property hold. Swapping data envs under a pipeline env (or vice versa) breaks the invariant.

Seen in

Last updated · 550 distilled / 1,221 read