Skip to content

SYSTEM Cited by 1 source

AWS Managed Flink (Managed Service for Apache Flink)

AWS Managed Service for Apache Flink (formerly Kinesis Data Analytics for Apache Flink) is AWS's managed runtime for Apache Flink jobs. It provisions capacity in KPU (Kinesis Processing Unit) bundles and manages checkpoints/savepoints to S3.

KPU unit economics

Per the service disclaimer quoted in the Zalando 2026-03 post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions):

"Managed Service for Apache Flink provisions capacity as KPUs. A single KPU provides you with 1 vCPU and 4GB of memory. For every KPU allocated, 50GB of running application storage is also provided. This means that the application resources are always configured in terms of KPUs, there's no way to allocate more storage without also allocating more CPU and memory, or more memory without also allocating more CPU and storage."

Canonicalised as concepts/kpu-aws-managed-flink.

Operational consequences:

  • State-heavy jobs over-provision CPU and memory just to get enough local storage (50 GB per KPU).
  • Every stop creates a savepoint by default ("this is a configurable setting in AWS Managed Flink that we had enabled"), so any scale-in/scale-out triggers a full snapshot. With large state, scaling windows become as long as the savepoint itself — Zalando saw 11–20 min per scaling operation.
  • Operators carry steady overscale margin (Zalando kept 10–20 % higher parallelism than normally required) to absorb the lag/restart cycle, and that margin shows up on the bill.
  • Available Flink version lags upstream. As of Feb 2026, the service only offered Flink 1.20, which does not include the MultiJoin operator (Flink 2.1, experimental). Teams that hit Table-API state amplification on managed Flink cannot wait for a version bump — they rewrite to DataStream API by hand.

Seen in

Last updated · 507 distilled / 1,218 read