SYSTEM Cited by 1 source
AWS Managed Flink (Managed Service for Apache Flink)¶
AWS Managed Service for Apache Flink (formerly Kinesis Data Analytics for Apache Flink) is AWS's managed runtime for Apache Flink jobs. It provisions capacity in KPU (Kinesis Processing Unit) bundles and manages checkpoints/savepoints to S3.
KPU unit economics¶
Per the service disclaimer quoted in the Zalando 2026-03 post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions):
"Managed Service for Apache Flink provisions capacity as KPUs. A single KPU provides you with 1 vCPU and 4GB of memory. For every KPU allocated, 50GB of running application storage is also provided. This means that the application resources are always configured in terms of KPUs, there's no way to allocate more storage without also allocating more CPU and memory, or more memory without also allocating more CPU and storage."
Canonicalised as concepts/kpu-aws-managed-flink.
Operational consequences:
- State-heavy jobs over-provision CPU and memory just to get enough local storage (50 GB per KPU).
- Every stop creates a savepoint by default ("this is a configurable setting in AWS Managed Flink that we had enabled"), so any scale-in/scale-out triggers a full snapshot. With large state, scaling windows become as long as the savepoint itself — Zalando saw 11–20 min per scaling operation.
- Operators carry steady overscale margin (Zalando kept 10–20 % higher parallelism than normally required) to absorb the lag/restart cycle, and that margin shows up on the bill.
- Available Flink version lags upstream. As of Feb 2026, the
service only offered Flink 1.20, which does not include the
MultiJoinoperator (Flink 2.1, experimental). Teams that hit Table-API state amplification on managed Flink cannot wait for a version bump — they rewrite to DataStream API by hand.
Seen in¶
- sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions — Zalando's Product Offer Enrichment pipeline runs on AWS Managed Flink 1.20. The KPU bundling is the direct reason state reduction (−76 %) delivered only ~13 % AWS cost savings, not proportional: vCPU and memory requirements didn't drop as much as storage, and memory requirements are bound by the same KPU bundle that provides storage.
Related¶
- systems/apache-flink — engine being managed.
- concepts/kpu-aws-managed-flink — the vCPU+RAM+storage bundling primitive.
- concepts/flink-snapshot-savepoint — the savepoint-on-stop behaviour that interacts with scale-in/out cost.
- companies/aws — provider.