SYSTEM Cited by 4 sources
Apache Airflow¶
Apache Airflow is the open-source workflow orchestration system originally created at Airbnb (2014; donated to ASF, TLP in 2019). Workflows are expressed as Python-defined Directed Acyclic Graphs (DAGs) of operator tasks; the Airflow scheduler parses the DAG files, schedules task instances according to their dependencies + cron-like or data-aware schedule, and tracks run state in a metadata database. Tasks execute on worker processes via one of a handful of executors (LocalExecutor, CeleryExecutor, KubernetesExecutor).
For the purposes of this wiki, Airflow is the default substrate for batch orchestration across data platforms — ETL, feature pipelines, ML training jobs, data-quality jobs, periodic reports, warehouse-to-store sync. The most common managed offering is Astronomer-hosted Airflow (commercial SaaS on top of OSS Airflow, founded by original Airflow contributors).
Common shapes it appears in¶
- Hand-written DAGs — one Python file per pipeline; the classic usage.
- Auto-generated DAGs — a higher-level configuration (YAML / JSON / SQL-plus-metadata) is compiled into Airflow DAGs by a codegen step. The config-driven DAG generation pattern — the platform owns DAG boilerplate (monitoring, data-quality gates, alerting, metadata tagging); customers own only feature-/dataset-specific query + metadata.
Seen in¶
- sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution
— Lyft's Feature Store batch lane
runs on Astronomer-hosted Airflow. Customers define a feature
with a SparkSQL query + a JSON config; a
Python cron service reads the configs and auto-generates a
production-ready DAG per feature. Generated DAGs execute the
SparkSQL query, write to Hive + to
dsfeatures, run integrated data-quality checks, and tag Amundsen for discoverability. Canonical instance of config-driven DAG generation. Lyft also runs the homegrown Kyte local development environment ("Airflow local development at Lyft", Airflow Summit 2022) — a CLI that lets developers validate configs, test SQL runs, execute DAGs locally, and backfill historical dates confidently. - sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql — Airflow orchestrates the Pinterest Vector Database as a Service. Teams submit a JSON schema (index alias + vector dim + source Hive table); an Airflow workflow validates the config, creates the OpenSearch index, runs the ingestion DAG, and publishes discovery metadata. Handles millions of embeddings with daily incremental updates. Second canonical wiki instance of config-driven DAG generation — analogous shape (JSON config → auto-generated DAG → production-grade platform-managed pipeline) at a different substrate tier: Lyft was features-into-feature-store; Pinterest is embeddings-into-vector-index. Also the canonical wiki instance of patterns/internal-vector-db-as-service.
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge
— multi-tenant offline evaluation instance. Zalando's
Search Quality
Framework uses Airflow to orchestrate its
LLM-as-a-judge pipeline: one
TaskGroupper market (LU / PT / GR), each containing a test-query-generation + search-result-retrieval + LLM-evaluation + NER-parity chain, fan-in consolidation task aggregates across all TaskGroups. Each stage is a Docker image run viaKubernetesPodOperator— the DAG stays orchestration- only, evaluation logic + dependencies encapsulated in the image. Canonical wiki instance of concepts/airflow-taskgroup-parallelism, patterns/per-market-parallel-taskgroup-dag, and patterns/podoperator-encapsulated-evaluation-job. A multi-tenant parallel-pipeline shape, distinct from the Lyft / Pinterest config-to-DAG codegen shape. -
— per-PR pipeline-environment shape. Zalando's Performance
Marketing org extends Airflow with a
DAGclass fork that rewrites DAG ids at init to inject the feature-branch name (qu.test_dag→qu.feature1.test_dag), combined with zip-packaged DAG deploys (one zip per PR for dependency isolation) and a separate cron that cleans up the env when the PR closes. Result: each PR gets an isolated pipeline environment on a shared Airflow server in <1 min — vs ~30 min to spin up a fresh MWAA server per PR. Paired with a matching data environment of per-PR-suffixed Spark databases populated via views over copies so data-env creation is also seconds-not-hours. Canonical wiki instance of patterns/per-pr-airflow-environment-via-dag-versioning, concepts/pipeline-environment, concepts/data-environment, concepts/dag-id-rewriting, concepts/airflow-dag-zip-packaging, and patterns/library-fork-for-dag-id-rewrite. - sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines
— Airflow as the orchestration host for SSH-based job
execution against EMR clusters at scale. Slack's data
platform (built around 2017) had Airflow SSH-ing into EMR
master nodes via
SSHOperatorand 6+ team-built variants, accumulating to 700+ jobs across 7 operator types (CrunchExecOperator,S3SyncOperator, etc.) by 2024 — canonical wiki instance of the SSH job execution anti-pattern at industrial scale. The 3-quarter migration replaced every SSH operator with REST operators that submit through Quarry to YARN / Trino / Snowflake — eliminating the long-lived-SSH-key surface from Airflow workers and delivering server-side job-state survival across Airflow Kubernetes pod restarts. Progress tracked via analytics dashboard backed by Airflow metadata-DB queries identifying remaining SSH-based tasks per team / per DAG / per region. Canonical wiki instance of patterns/incremental-operator-by-operator-migration applied to Airflow.
Related¶
- systems/apache-spark — the compute engine Airflow most often schedules in data platforms.
- systems/apache-hive — the catalog / offline store Airflow pipelines commonly land to.
- systems/pinterest-vector-db-service
- systems/zalando-search-quality-framework
- systems/zalando-search-query-clustering
- systems/zalando-marketing-roi-pipeline
- systems/mwaa
- companies/lyft
- companies/pinterest
- companies/zalando
- patterns/config-driven-dag-generation
- patterns/internal-vector-db-as-service
- patterns/per-market-parallel-taskgroup-dag
- patterns/podoperator-encapsulated-evaluation-job
- patterns/per-pr-airflow-environment-via-dag-versioning
- patterns/library-fork-for-dag-id-rewrite
- patterns/view-over-copy-for-test-data-environment
- patterns/cron-driven-pr-closed-cleanup
- concepts/airflow-taskgroup-parallelism
- concepts/pipeline-environment
- concepts/data-environment
- concepts/dag-id-rewriting
- concepts/airflow-dag-zip-packaging
- concepts/per-pr-ephemeral-environment
- concepts/feature-store
- patterns/hybrid-batch-streaming-ingestion
- systems/slack-quarry — Slack's REST gateway that replaced SSH-based Airflow operators against EMR.
- systems/apache-yarn — common Airflow downstream.
- concepts/ssh-job-execution-anti-pattern, concepts/rest-based-job-submission — the paradigm shift in Airflow→EMR submission paths Slack canonicalised.
- patterns/rest-gateway-for-compute-engine-job-submission, patterns/incremental-operator-by-operator-migration — the architectural and rollout patterns from Slack's Airflow-operator migration.