SYSTEM Cited by 2 sources
Apache Airflow¶
Apache Airflow is the open-source workflow orchestration system originally created at Airbnb (2014; donated to ASF, TLP in 2019). Workflows are expressed as Python-defined Directed Acyclic Graphs (DAGs) of operator tasks; the Airflow scheduler parses the DAG files, schedules task instances according to their dependencies + cron-like or data-aware schedule, and tracks run state in a metadata database. Tasks execute on worker processes via one of a handful of executors (LocalExecutor, CeleryExecutor, KubernetesExecutor).
For the purposes of this wiki, Airflow is the default substrate for batch orchestration across data platforms — ETL, feature pipelines, ML training jobs, data-quality jobs, periodic reports, warehouse-to-store sync. The most common managed offering is Astronomer-hosted Airflow (commercial SaaS on top of OSS Airflow, founded by original Airflow contributors).
Common shapes it appears in¶
- Hand-written DAGs — one Python file per pipeline; the classic usage.
- Auto-generated DAGs — a higher-level configuration (YAML / JSON / SQL-plus-metadata) is compiled into Airflow DAGs by a codegen step. The config-driven DAG generation pattern — the platform owns DAG boilerplate (monitoring, data-quality gates, alerting, metadata tagging); customers own only feature-/dataset-specific query + metadata.
Seen in¶
- sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution
— Lyft's Feature Store batch lane
runs on Astronomer-hosted Airflow. Customers define a feature
with a SparkSQL query + a JSON config; a
Python cron service reads the configs and auto-generates a
production-ready DAG per feature. Generated DAGs execute the
SparkSQL query, write to Hive + to
dsfeatures, run integrated data-quality checks, and tag Amundsen for discoverability. Canonical instance of config-driven DAG generation. Lyft also runs the homegrown Kyte local development environment ("Airflow local development at Lyft", Airflow Summit 2022) — a CLI that lets developers validate configs, test SQL runs, execute DAGs locally, and backfill historical dates confidently. - sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql — Airflow orchestrates the Pinterest Vector Database as a Service. Teams submit a JSON schema (index alias + vector dim + source Hive table); an Airflow workflow validates the config, creates the OpenSearch index, runs the ingestion DAG, and publishes discovery metadata. Handles millions of embeddings with daily incremental updates. Second canonical wiki instance of config-driven DAG generation — analogous shape (JSON config → auto-generated DAG → production-grade platform-managed pipeline) at a different substrate tier: Lyft was features-into-feature-store; Pinterest is embeddings-into-vector-index. Also the canonical wiki instance of patterns/internal-vector-db-as-service.
Related¶
- systems/apache-spark — the compute engine Airflow most often schedules in data platforms.
- systems/apache-hive — the catalog / offline store Airflow pipelines commonly land to.
- systems/pinterest-vector-db-service
- companies/lyft
- companies/pinterest
- patterns/config-driven-dag-generation
- patterns/internal-vector-db-as-service
- concepts/feature-store
- patterns/hybrid-batch-streaming-ingestion