PATTERN Cited by 1 source
Python DSL Wrapping CloudFormation¶
Intent¶
When a large engineering org has a recurring domain-shaped
infrastructure need (ML pipelines, data pipelines, standard
microservices, event-driven workers) whose expression in raw
CloudFormation YAML or
AWS CDK TypeScript would be verbose and
error-prone, build a narrower Python DSL whose primitives
match the domain and whose .generate() call emits a
CloudFormation template — which then flows through the normal
continuous-delivery pipeline to AWS. Users of the DSL never see
CloudFormation, Step Functions ASL, or CDK.
Context¶
Written from the Zalando ML Platform team's 2022 post (sources/2022-04-18-zalando-zalandos-machine-learning-platform) but the pattern is general.
Preconditions for the pattern to pay off:
- Enough consuming teams. Zalando has over a hundred product teams using zflow; since 2019 they have authored hundreds of pipelines.
- Enough recurring shape. ML pipelines at Zalando are all the same rough shape: preprocessing → training → batch predictions → reporting → endpoint deployment. That shape justifies a narrow DSL.
- A central platform team that can build + maintain the DSL. The 2022 post names "two teams actively develop zflow and monitoring tools for pipelines."
- Consumers are not infrastructure engineers. Zalando's zflow users are applied scientists and ML engineers — the DSL's value is protecting them from YAML / CDK internals.
Motivation¶
From Zalando's 2022 disclosure, verbatim:
"CloudFormation templates are highly expressive and allow developers to describe even minute details. Unfortunately, CF files can become verbose and are tedious to edit manually. We addressed this problem by creating zflow, a Python tool for building machine learning pipelines."
See concepts/cloudformation-verbosity-problem for the detailed catalogue of YAML-authoring pain.
Solution¶
Build a Python library exposing a domain-specific builder API:
# Zalando zflow example, verbatim from the 2022 post:
data_processing = databricks_job("data_processing_job")
training = training_job("training_job")
batch_inference = batch_transform_job("batch_transform_job")
pipeline = PipelineBuilder("example-pipeline")
pipeline \
.add_stage(data_processing) \
.add_stage(training) \
.add_stage(batch_inference)
stack = StackBuilder("example-stack")
stack.add_pipeline(pipeline)
stack.generate(output_location="zflow_pipeline.yaml")
Under the hood:
- The DSL's builder objects carry type hints — IDE autocomplete and lint catches many errors before template synthesis.
stack.generate()invokes AWS CDK to synthesise a CloudFormation template (JSON or YAML). CDK handles the tedious resource-graph details.- The generated template is written to a file in the user's repo.
- User commits + pushes the template. A continuous delivery pipeline (Zalando calls theirs CDP) picks it up and deploys via CloudFormation.
- CloudFormation materialises the Step Functions state machine + Lambdas + IAM policies.
Six-step Zalando flow (verbatim labels)¶
From the post:
- We describe our ML pipelines in Python scripts with zflow DSL.
- When we run the pipeline script, zflow will internally call AWS CDK to generate a CloudFormation template.
- We commit and push the template to a git repository, and Zalando Continuous Delivery Platform will then upload it to AWS CloudFormation.
- CloudFormation will create all the resources specified in the template, most notably: a Step Functions workflow. Our pipeline is now ready to run.
- A web portal built with Backstage provides a visual overview of running pipelines.
(The post lists six steps overall; the enumeration above condenses the related ones.)
Consequences¶
Pros:
- Domain concepts become first-class. A zflow user writes
training_job(...), not a raw Step Functions ASL state with a SageMaker service integration and a bespoke IAM role. - Type-checked authoring. Python's type hints catch errors that YAML templates only catch at deploy time.
- Code first, template second. The canonical artefact checked into git is the Python DSL script, not the generated YAML. The YAML is a compiled intermediate (though it also gets committed for CD to deploy).
- Delivery pipeline reuse. The DSL-generated template rides the normal CD pipeline for infrastructure, not a special ML pipeline. This is by design: "All that is needed now is to commit and push the generated template to the git repository and let Zalando Continuous Delivery Platform (CDP) deploy it to AWS."
- Platform team can enforce conventions. IAM least-privilege, observability, tagging, retry policies, and so on can be baked into the DSL's default code-gen and become impossible to forget.
Cons:
- Maintenance cost for the platform team. At least one team's worth of engineering investment. Zalando dedicates two teams.
- Abstraction leaks. When the DSL doesn't support a new AWS feature, users are blocked until the DSL is extended.
- DSL forks over time. The DSL accretes per-team special cases; versioning and deprecation become real operational work.
- Learning the DSL instead of CDK. For small orgs or infrastructure-literate users, a narrow DSL can be net slower to learn than vanilla CDK.
Canonical instance¶
- systems/zflow (Zalando ML Platform, ≥ 2019) — Python
DSL over CDK whose primitives are
databricks_job,training_job,batch_transform_job,hyperparameter_tuning, plus flow-control for conditional / parallel stages. Used to author hundreds of ML pipelines across 100+ product teams. The pipeline's compiled output is a Step Functions state machine invoking SageMaker, systems/databricks, and Lambda as stage steps. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
Variants¶
- Narrow DSL for data pipelines — same pattern, different domain: Apache Airflow's Python DAG is a closely-related idea, though Airflow is a runtime scheduler rather than a CloudFormation template generator.
- AWS SAM — a domain-narrow YAML DSL for serverless applications that generates CloudFormation; lower-level than Zalando's zflow but same compile-to-CloudFormation idea.
- Terraform modules — similar compile-to-provider-API idea in a different IaC universe.
Related¶
- systems/zflow — canonical instance.
- systems/aws-cdk · systems/cloudformation — the substrates underneath.
- systems/aws-step-functions — the typical compiled target.
- concepts/cloudformation-verbosity-problem — the motivating pain.
- concepts/declarative-lifecycle-api — related pattern.
- patterns/managed-services-over-custom-ml-platform — zflow's umbrella pattern: the DSL is the Python-DSL realisation of that broader architectural choice.
- patterns/ml-platform-internal-consulting-team — who builds and maintains such a DSL at Zalando's scale.