Skip to content

PATTERN Cited by 1 source

PodOperator-encapsulated Evaluation Job

Intent

Ship each evaluation-pipeline stage as a Docker image run via KubernetesPodOperator, so that:

  • The Airflow DAG stays small — just step wiring, no business logic.
  • Stage dependencies (Python libraries, ML model clients, custom binaries) live in the image, not the Airflow worker.
  • Stage code is independently versioned and testable.
  • Scaling is per-pod via Kubernetes, not per-Airflow-worker.

(Source: sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge.)

Structure

Airflow DAG
   ├─ KubernetesPodOperator: image=zalando/sq-generate:v42
   ├─ KubernetesPodOperator: image=zalando/sq-retrieve:v42
   ├─ KubernetesPodOperator: image=zalando/sq-evaluate:v42
   └─ KubernetesPodOperator: image=zalando/sq-ner-diff:v42

Each image owns:
   - its dependencies (requirements.txt / Gemfile / cargo.toml)
   - its entrypoint (evaluator.py / retriever.py / ...)
   - its resource requests (CPU / memory / GPU)
   - its env-config contract (secrets, mount paths)

Airflow DAG owns:
   - pod spec template
   - task dependencies / ordering
   - retry policy
   - upstream/downstream XCom data handoff

Zalando's quote

"[PodOperator]: The evaluation code is shipped in a docker image and we can run it in our Kubernetes cluster via Airflow using PodOperator. This keeps the DAG code clean and simple, as all complex logic for the evaluation and their dependencies are encapsulated in the image."

Why this over in-worker Python

  • Dependency isolation. An LLM-judge stage might need openai, an ES-retrieval stage needs elasticsearch-py, a product-data stage needs Zalando-internal SDKs. Co-installing all of those on the Airflow workers creates a version-constraint maze; per-image isolation eliminates it.
  • Resource isolation. LLM-calling stages need lots of concurrent connections + memory; retrieval stages need minimal resources. Per-pod resource requests let Kubernetes schedule each optimally.
  • Independent release cadence. Stage-specific bug fixes don't need DAG redeploy; just bump the image tag.
  • Cleaner DAG code. The Airflow DAG becomes a pure orchestration spec — operators, dependencies, retries. Much easier to read, diff, and review than a DAG where each task is a hundred-line Python function.
  • Test locally. The Docker image runs outside Airflow in the developer's local environment; integration tests don't need an Airflow installation.

Stage-specific considerations

  • Test-query generation — needs access to the Data Lake and the NER engine's lemmatiser; image bundles pyspark + NER client.
  • Search-result retrieval — HTTP client against the search microservice; lightweight image.
  • LLM evaluation — OpenAI SDK + image-fetching client + cache client; most complex image, most expensive pod.
  • NER analyser — NER engine client; lighter than evaluation pod.

Handoff between pods

PodOperator-to-PodOperator data handoff is typically:

  • Shared object storage (S3 / GCS) for large artefacts — test-query sets, result sets, scores. The DAG passes object-store URIs as XCom.
  • Airflow XCom for small control metadata — run IDs, counts, segment boundaries.
  • Per-run namespace (run ID in object-store prefix) to isolate parallel runs.

The source doesn't explicitly describe Zalando's handoff mechanism; object-storage-via-XCom-URI is the usual shape.

Variations

  • Per-stage image (Zalando's shape).
  • Monolithic image + entrypoint-selector. Single image, different --mode=generate | retrieve | evaluate entry points. Trades dependency isolation for single-image discipline.
  • Language-polyglot. Different stages in different languages — e.g. the retrieval stage in Go, the evaluation stage in Python.

Seen in

Last updated · 507 distilled / 1,218 read