CONCEPT Cited by 1 source
Checkpointed automation script¶
Definition¶
A checkpointed automation script is an upgrade / migration driver that runs as a sequence of discrete steps, each of which can either:
- Auto-proceed to the next step, or
- Pause and wait for explicit human confirmation before proceeding.
The mode is per-cluster (or per-run), not hard-coded — the same script drives a low-risk cluster end-to-end automatically and a high-risk cluster one-step-at-a-time with engineer approval gates.
Why this shape¶
The first few runs of a new migration script are where confidence is lowest and the consequences of a bug are highest. Full automation from day one is premature optimisation; fully manual forever is wasted engineering time. Checkpointed automation lets the operator dial the risk to match the confidence — confirmation mode early, auto-proceed once the same steps have succeeded repeatedly.
Secondary benefit: the checkpoints themselves document the step boundaries — where recovery starts from if something fails — in a form the operator can read without reconstructing the script's state machine.
What the steps typically do¶
In Yelp's Cassandra upgrade script the steps span multiple external systems:
- Run
kubectlcommands to cordon / uncordon nodes and patch the pod image environment variables. - Run CLI commands against the Cassandra cluster (
nodetool drain, schema-version checks, backup verification). - Open pull requests to change manifests / configs.
- Probe dashboards for per-keyspace p99 latency / errors before proceeding to the next step.
Seen in¶
- sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. "[We] implemented it as a script that executes various kubectl and CLI commands, creates pull requests, and performs other workflow steps. The script can run in auto-proceed mode or pause for confirmation from an engineer after each step, which was particularly useful when upgrading critical clusters."
Related¶
- patterns/pre-flight-flight-post-flight-upgrade-stages — the upgrade shape the script drives.
- concepts/rolling-upgrade — the upgrade idiom.
- systems/apache-cassandra — canonical context.