Skip to content

CONCEPT Cited by 1 source

Flink Snapshot / Savepoint

A Flink savepoint (and its close sibling, the checkpoint) is a consistent, durable snapshot of all operator state in a running Flink job, written to object storage (S3 / GCS / HDFS). It is the unit of recovery on failure and of restart during scale-in/out or version-change, and on AWS Managed Flink it is triggered both on a cronned schedule and by every stop operation.

Savepoint cost model

Creating a savepoint requires the cluster to:

  1. Iterate the full RocksDB state of every stateful operator.
  2. Serialize the state.
  3. Move the serialized bytes to the configured object store.

This is CPU-, I/O-, and network-bound and runs concurrently with normal event processing. At small state (GB or less) it is invisible. Past a threshold that depends on cluster sizing and RocksDB working set, it becomes the dominant workload.

Pathology at 235 GB state

From the Zalando post (sources/2026-03-03-zalando-why-we-ditched-flink-table-api-joins-cutting-state-by-75-with-datastream-unions):

  • CPU exhaustion. "To snapshot 235GB, Flink must iterate over the RocksDB state, serialize it, and move it to S3. This would keep the cluster's CPU at 100% for nearly 12 minutes."
  • Backpressure. "Because the application was running close to the CPU limit, it couldn't process records. The lag would start getting higher and higher."
  • Crash-restart loop. "Often, the Flink application would simply give up and restart. Because Flink restarts involve reloading the state from S3, we would sometimes fall behind our 1-hour SLA. By the time the app was back up, it would be almost time for the next snapshot."
  • Snapshot failures. "Due to forced restarts, many snapshots just couldn't be taken. This was again making us vulnerable because of unreliable data backups."
  • Scaling windows. "Every scaling operation on a Flink application involves a full job restart … time proportional to the snapshot creation — that is, 11–12, sometimes up to 20 minutes. Because of that, the parallelism for the application was constantly kept at 10–20 % higher than normally required."

This is the canonical snapshot storm: snapshot cost ≳ processing budget → lag grows → restart → state reload → another snapshot nearly due. Extending the snapshot interval trades snapshot frequency for restore-time SLO risk.

Direct interaction with state amplification

The pathology is load-bearing because state size is the lever. In Zalando's case, state was inflated by stateful join state amplification across four chained Table-API joins. Moving to a single KeyedProcessFunction with one ValueState per SKU cut state from 235 GB to 56 GB; snapshot duration dropped from 11 min to 2.5 min, CPU from spiky 100 % to stable ~30 %, and restart time from 12–20 min to 4–5 min.

Seen in

Last updated · 507 distilled / 1,218 read