CONCEPT Cited by 1 source
Flink Snapshot / Savepoint¶
A Flink savepoint (and its close sibling, the checkpoint) is a consistent, durable snapshot of all operator state in a running Flink job, written to object storage (S3 / GCS / HDFS). It is the unit of recovery on failure and of restart during scale-in/out or version-change, and on AWS Managed Flink it is triggered both on a cronned schedule and by every stop operation.
Savepoint cost model¶
Creating a savepoint requires the cluster to:
- Iterate the full RocksDB state of every stateful operator.
- Serialize the state.
- Move the serialized bytes to the configured object store.
This is CPU-, I/O-, and network-bound and runs concurrently with normal event processing. At small state (GB or less) it is invisible. Past a threshold that depends on cluster sizing and RocksDB working set, it becomes the dominant workload.
Pathology at 235 GB state¶
From the Zalando post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions):
- CPU exhaustion. "To snapshot 235GB, Flink must iterate over the RocksDB state, serialize it, and move it to S3. This would keep the cluster's CPU at 100% for nearly 12 minutes."
- Backpressure. "Because the application was running close to the CPU limit, it couldn't process records. The lag would start getting higher and higher."
- Crash-restart loop. "Often, the Flink application would simply give up and restart. Because Flink restarts involve reloading the state from S3, we would sometimes fall behind our 1-hour SLA. By the time the app was back up, it would be almost time for the next snapshot."
- Snapshot failures. "Due to forced restarts, many snapshots just couldn't be taken. This was again making us vulnerable because of unreliable data backups."
- Scaling windows. "Every scaling operation on a Flink application involves a full job restart … time proportional to the snapshot creation — that is, 11–12, sometimes up to 20 minutes. Because of that, the parallelism for the application was constantly kept at 10–20 % higher than normally required."
This is the canonical snapshot storm: snapshot cost ≳ processing budget → lag grows → restart → state reload → another snapshot nearly due. Extending the snapshot interval trades snapshot frequency for restore-time SLO risk.
Direct interaction with state amplification¶
The pathology is load-bearing because state size is the lever. In
Zalando's case, state was inflated by
stateful join
state amplification across four chained Table-API joins. Moving
to a single KeyedProcessFunction with one ValueState per SKU
cut state from 235 GB to 56 GB; snapshot duration dropped from
11 min to 2.5 min, CPU from spiky 100 % to stable ~30 %, and
restart time from 12–20 min to 4–5 min.
Seen in¶
- sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions — canonical instance; hourly savepoint on AWS Managed Flink 1.20 became the binding constraint on Product Offer Enrichment pipeline availability and cost.
Related¶
- systems/apache-flink — engine that implements checkpoints / savepoints.
- systems/rocksdb — default state backend; what gets iterated during a snapshot.
- systems/aws-managed-flink — the managed runtime that triggers savepoints on every stop (scale-in/out, deploy).
- systems/aws-s3 — typical target for snapshot writes.
- concepts/flink-stateful-join-state-amplification — the upstream cause when savepoint cost grows pathologically.
- concepts/kpu-aws-managed-flink — why savepoint cost directly translates to cluster sizing and cost.
- concepts/backpressure — what happens to ingestion while CPU is pinned by the snapshot.