Skip to content

CONCEPT Cited by 1 source

Fault-tolerant long-running workflow

Definition

A fault-tolerant long-running workflow is any workflow whose expected execution duration is long enough that the probability of encountering some component failure over its run is non-negligible — and whose design therefore persists durable checkpoints at every decision point so that restart from the last successful checkpoint is the recovery path for any failure.

The definition is not about the specific failure rate of the infrastructure; it's about the ratio between expected run-time and the mean-time-between-failures of the substrate:

P(at least one failure during run) ≈ 1 - exp(-run_time / MTBF)

Once that probability is no longer negligible, fault tolerance stops being a nice-to-have and becomes a correctness requirement. A workflow that can't restart from a checkpoint in the middle of its run will fail overall with near certainty over enough runs — the alternative is restarting from scratch, which at scale is equivalent to never completing.

Architectural properties

  • State at every decision point is durable. Every logical progress point (last-copied key, last-applied GTID, VDiff progress, workflow lock state) is persisted in durable storage before the next step begins.
  • Restart is resumption, not retry. On failure, the workflow resumes from the persisted state — it does not repeat already-completed work.
  • Idempotency at step level. Individual steps can be run multiple times without corruption if the workflow restarts partway through one.
  • Bounded progress per step. Steps are designed small enough that re-running one is cheap.
  • Observability of state. An operator can inspect the workflow's current state at any moment — crucial for debugging a workflow mid-run.

Why this matters at petabyte scale

At petabyte scale, even a "reliable" fleet will experience some failure during any multi-day workflow — a tablet rebooted for kernel patching, a network hiccup across a region boundary, a disk-full incident, a schema-change contention. The 2026-02-16 PlanetScale post states this explicitly as the reason VReplication, VDiff, and MoveTables are all designed the way they are:

All of this work is done in a fault-tolerant way. This means that anything can fail throughout this process and the system will be able to recover and continue where it left off. This is critical for data imports at a certain scale where things can take many hours, days, or even weeks to complete and the likelihood of encountering some type of error — even an ephemeral network or connection related error across the fleet of processes involved in the migration — becomes increasingly likely.

(Source: .)

Concrete checkpoints in Vitess migrations

Every state-bearing decision point in a Vitess VReplication workflow persists state in sidecar tables:

  • copy_state — per-stream per-table last-copied key position. On restart, the copy resumes from exactly this key.
  • vreplication — per-stream workflow metadata + advancing GTID position. On restart, replication resumes from the exact GTID.
  • VDiff sidecar tables — per-table diff progress. A failed VDiff resumes from where it left off, not from scratch.
  • Topology-server workflow state — workflow locks, freeze/active markers, routing rules. Survives tablet restart.

Seen in

  • sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-enginethird canonical substrate for the fault-tolerant-long- running-workflow shape, at the embedded-library-in- service altitude: Airbnb's Skipper handles multi-hour and multi-day workflows (insurance claim processing, policy lifecycle, Flink-job lifecycle, video processing pipelines, scheduled financial operations spanning days or weeks) by persisting state fields + checkpointed action results directly in the host service's database (MySQL / UDS / DynamoDB) and replaying the workflow method on recovery. Same architectural discipline as Vitess VReplication and Temporal event history, at a third substrate altitude. At peak, 10 000 workflows / second on DynamoDB. "Across all domains, Skipper guarantees that every workflow reaches a terminal state, even through infrastructure failures, deployments, and infrastructure disruptions."

  • second canonical substrate for the fault-tolerant-long-running- workflow shape: Temporal's event history (append-only log) is the durable checkpoint substrate; workflow rehydration on crash = event-history replay + activity dedup + catch-up. Longoria 2022-07-22: "Temporal captures the progress of a workflow execution (or workflow steps) in a log called the history. In case of a crash, Temporal rehydrates the workflow." Different substrate from Lord's Vitess VReplication framing (per-stream copy-state + vreplication + VDiff sidecar tables), but identical architectural discipline: every decision point is persisted in durable storage before the next step begins, and restart is resumption not retry. The two sources bracket the data-plane (Vitess zero-downtime migrations) and control- plane (Temporal workflow orchestration) altitudes of the same scale-derived correctness framing.

  • — canonical wiki statement that fault tolerance is a load-bearing property for workflows whose duration approaches fleet MTBF. Matt Lord's framing promotes fault-tolerance from "an availability feature" to "a correctness requirement at this scale." Vitess's VReplication, VDiff, and MoveTables workflows are all designed with durable checkpoints at every decision point so restart-from-checkpoint is always the recovery path. The post is the first canonical wiki statement of this scale-derived correctness framing — concurrent with, and consistent with, the already-canonicalised Netflix Maestro framing of long-running workflow orchestration but at a different substrate (database data motion vs scheduled-job DAG).

Last updated · 542 distilled / 1,571 read