CONCEPT Cited by 1 source
Data environment¶
Definition¶
A data environment is a named set of databases / tables / views (all sharing a naming suffix) that a pipeline environment reads from and writes to. It is the data-layer half of an isolated end-to-end pipeline run.
From sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning:
"A data environment is a set of Spark/Hive databases, tables and views. A pipeline environment uses a single data environment for reading and writing data."
At Zalando the naming convention is a suffix: _live, _test, _feature1. Example database names: db_attribution_live, db_attribution_test, db_attribution_feature1.
Why the abstraction matters¶
Without data isolation, a per-PR pipeline environment still collides with every other PR on the shared _test tables: two in-flight features both writing to db_attribution_test.m_events corrupt each other's runs. Versioning the compute layer without versioning the data layer does not give true end-to-end isolation.
Implementation strategies¶
- Copy all rows from source env → new env. Correct but slow and expensive (hours + storage cost) for large tables.
- View-based — create
CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_eventsfor every read table. Cheap (DDL only, no data copy). Only output tables (those written by tasks in this pipeline) get real tables, optionally seeded by a partition-range copy.
Zalando's data-environment creation script defaults to (2) — the view-over-copy pattern — because the vast majority of tables in a pipeline are inputs, not outputs.
1-to-1 with pipeline environment¶
A pipeline environment consumes exactly one data environment. The pairing is what makes the end-to-end isolation property hold. Swapping data envs under a pipeline env (or vice versa) breaks the invariant.
Related¶
- concepts/pipeline-environment
- concepts/view-based-data-environment
- concepts/per-pr-ephemeral-environment
- systems/apache-spark
- systems/aws-s3
- patterns/view-over-copy-for-test-data-environment
- patterns/per-pr-airflow-environment-via-dag-versioning
Seen in¶
- sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando's marketing ROI pipeline has
_live,_test, and per-PR_featureNdata environments on Spark/S3.