Skip to content

CONCEPT Cited by 1 source

Airflow TaskGroup Parallelism

Definition

Airflow TaskGroup parallelism is the DAG-structuring discipline of placing N instances of the same logical pipeline into N separate TaskGroup subgraphs inside one DAG, so that each instance runs independently in parallel, with an optional final task consolidating all instances' outputs.

It's the Airflow-native answer to "run this pipeline for every market / every tenant / every shard" without spawning N separate DAGs.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

The pattern in Zalando's framework

Zalando quotes it directly:

"[Taskgroup]: We want to be able to evaluate multiple markets in parallel, where each market shares the same flow but with different test queries. Therefore we can implement each evaluation lineage as a task group and put all of them together in the same DAG. This way each task group can run independently in parallel and, once they are all finished, a final task consolidates all evaluation results together." (Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structurally:

                    ┌─────────── TaskGroup: market=LU ──────────┐
                    │   generate → retrieve → evaluate → report │
           ┌────────┤                                           │
           │        └───────────────────────────────────────────┘
DAG entry──┤        ┌─────────── TaskGroup: market=PT ──────────┐
           ├────────┤   generate → retrieve → evaluate → report │
           │        └───────────────────────────────────────────┘
           │        ┌─────────── TaskGroup: market=GR ──────────┐
           └────────┤   generate → retrieve → evaluate → report │
                    └───────────────────┬───────────────────────┘
                                ┌───────┴────────┐
                                │ consolidation   │
                                │ task (fan-in)   │
                                └────────────────┘

Why TaskGroups, not separate DAGs

Two SRE-relevant properties:

  • One scheduling unit. One trigger (cron, manual, external event) fans out to all markets; one status readout aggregates back. No per-tenant cron herd, no per-tenant alerting rule proliferation.
  • One consolidation task. Cross-market reports ("which markets share low-scoring brand segments") require all market results in one place. Separate DAGs would force an external aggregator; TaskGroups keep the fan-in inside Airflow.

Why TaskGroups, not serial iteration

The obvious alternative — one task that loops over markets — is worse along three axes:

  • No per-market retry isolation. One market's transient failure forces re-running all other markets.
  • Observability collapses. The Airflow UI shows one running task, not per-market status.
  • No true parallelism. Serial loop waits sequentially; N TaskGroups use N task slots.

What's inside each TaskGroup

In Zalando's case, each TaskGroup contains three Kubernetes PodOperator stages plus an NER-analyser sidecar, all Docker-image-encapsulated. See patterns/podoperator-encapsulated-evaluation-job.

Tradeoffs

  • DAG complexity scales with N markets. Very large N (say, hundreds of tenants) starts to strain the Airflow scheduler's DAG-parse time and UI rendering; at that scale, dynamic task mapping (Airflow 2.3+) or separate DAGs with a trigger-all parent DAG become better fits.
  • Shared resource contention at fan-out. All TaskGroups hitting the same Product API / Elasticsearch cluster at once can saturate downstreams. Zalando's cache partially mitigates but doesn't eliminate; the source doesn't discuss rate-limiting.
  • Consolidation task becomes a serial bottleneck. Last TaskGroup to finish gates the consolidation; runtime = max(TaskGroup) + consolidation.

Seen in

Last updated · 507 distilled / 1,218 read