CONCEPT Cited by 1 source

Flink Snapshot / Savepoint¶

A Flink savepoint (and its close sibling, the checkpoint) is a consistent, durable snapshot of all operator state in a running Flink job, written to object storage (S3 / GCS / HDFS). It is the unit of recovery on failure and of restart during scale-in/out or version-change, and on AWS Managed Flink it is triggered both on a cronned schedule and by every stop operation.

Savepoint cost model¶

Creating a savepoint requires the cluster to:

Iterate the full RocksDB state of every stateful operator.
Serialize the state.
Move the serialized bytes to the configured object store.

This is CPU-, I/O-, and network-bound and runs concurrently with normal event processing. At small state (GB or less) it is invisible. Past a threshold that depends on cluster sizing and RocksDB working set, it becomes the dominant workload.

Pathology at 235 GB state¶

From the Zalando post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions):

CPU exhaustion. "To snapshot 235GB, Flink must iterate over the RocksDB state, serialize it, and move it to S3. This would keep the cluster's CPU at 100% for nearly 12 minutes."
Backpressure. "Because the application was running close to the CPU limit, it couldn't process records. The lag would start getting higher and higher."
Crash-restart loop. "Often, the Flink application would simply give up and restart. Because Flink restarts involve reloading the state from S3, we would sometimes fall behind our 1-hour SLA. By the time the app was back up, it would be almost time for the next snapshot."
Snapshot failures. "Due to forced restarts, many snapshots just couldn't be taken. This was again making us vulnerable because of unreliable data backups."
Scaling windows. "Every scaling operation on a Flink application involves a full job restart … time proportional to the snapshot creation — that is, 11–12, sometimes up to 20 minutes. Because of that, the parallelism for the application was constantly kept at 10–20 % higher than normally required."

This is the canonical snapshot storm: snapshot cost ≳ processing budget → lag grows → restart → state reload → another snapshot nearly due. Extending the snapshot interval trades snapshot frequency for restore-time SLO risk.

Direct interaction with state amplification¶

The pathology is load-bearing because state size is the lever. In Zalando's case, state was inflated by stateful join state amplification across four chained Table-API joins. Moving to a single KeyedProcessFunction with one ValueState per SKU cut state from 235 GB to 56 GB; snapshot duration dropped from 11 min to 2.5 min, CPU from spiky 100 % to stable ~30 %, and restart time from 12–20 min to 4–5 min.

Seen in¶

sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions — canonical instance; hourly savepoint on AWS Managed Flink 1.20 became the binding constraint on Product Offer Enrichment pipeline availability and cost.

systems/apache-flink — engine that implements checkpoints / savepoints.
systems/rocksdb — default state backend; what gets iterated during a snapshot.
systems/aws-managed-flink — the managed runtime that triggers savepoints on every stop (scale-in/out, deploy).
systems/aws-s3 — typical target for snapshot writes.
concepts/flink-stateful-join-state-amplification — the upstream cause when savepoint cost grows pathologically.
concepts/kpu-aws-managed-flink — why savepoint cost directly translates to cluster sizing and cost.
concepts/backpressure — what happens to ingestion while CPU is pinned by the snapshot.

Flink Snapshot / Savepoint¶

Savepoint cost model¶

Pathology at 235 GB state¶

Direct interaction with state amplification¶

Seen in¶

Related¶