Skip to content

CONCEPT Cited by 1 source

Checkpointed automation script

Definition

A checkpointed automation script is an upgrade / migration driver that runs as a sequence of discrete steps, each of which can either:

  • Auto-proceed to the next step, or
  • Pause and wait for explicit human confirmation before proceeding.

The mode is per-cluster (or per-run), not hard-coded — the same script drives a low-risk cluster end-to-end automatically and a high-risk cluster one-step-at-a-time with engineer approval gates.

Why this shape

The first few runs of a new migration script are where confidence is lowest and the consequences of a bug are highest. Full automation from day one is premature optimisation; fully manual forever is wasted engineering time. Checkpointed automation lets the operator dial the risk to match the confidence — confirmation mode early, auto-proceed once the same steps have succeeded repeatedly.

Secondary benefit: the checkpoints themselves document the step boundaries — where recovery starts from if something fails — in a form the operator can read without reconstructing the script's state machine.

What the steps typically do

In Yelp's Cassandra upgrade script the steps span multiple external systems:

  • Run kubectl commands to cordon / uncordon nodes and patch the pod image environment variables.
  • Run CLI commands against the Cassandra cluster (nodetool drain, schema-version checks, backup verification).
  • Open pull requests to change manifests / configs.
  • Probe dashboards for per-keyspace p99 latency / errors before proceeding to the next step.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. "[We] implemented it as a script that executes various kubectl and CLI commands, creates pull requests, and performs other workflow steps. The script can run in auto-proceed mode or pause for confirmation from an engineer after each step, which was particularly useful when upgrading critical clusters."
Last updated · 476 distilled / 1,218 read