Skip to content

PATTERN Cited by 1 source

Workflow step breakpoint

Summary

Add an IDE-style pause-at-step primitive to a workflow orchestrator. Users set breakpoints on specific steps; when a workflow instance reaches a breakpoint, the step enters a paused state, halting the workflow's progression. An operator can inspect step state, optionally mutate it, and resume on a per-instance basis.

Canonical wiki instance: Netflix Maestro's breakpoint primitive (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator).

Problem

Debugging workflows historically falls back to teardown + re-run:

  • A bug is discovered mid-run in a long ETL → stop workflow, clear state, fix, re-run from scratch.
  • A foreach iteration produces wrong output for specific inputs → no way to pause just the bad iterations for inspection.
  • In-flight state has drifted + needs manual correction → no supported mutation path; operators resort to database surgery.

All three reduce operator productivity + risk workflow correctness.

Solution

Treat the workflow orchestrator like a programmable IDE debugger with three properties:

1. Per-step breakpoints

"Maestro allows users to set breakpoints on workflow steps, functioning similarly to code-level breakpoints in an IDE." (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator)

Breakpoints are attached to step definitions. When any instance reaches that step, execution pauses.

2. Per-instance resume

"If multiple instances of a workflow step are paused at a breakpoint, resuming one instance will only affect that specific instance, leaving the others in a paused state. Deleting the breakpoint will cause all paused step instances to resume."

Fine-grained resume separates debugging individual instances from triaging the underlying issue.

3. Foreach-aware + state-mutable

"Setting a single breakpoint on a step will cause all iterations of the foreach loop to pause at that step for debugging purposes. Additionally, the breakpoint feature allows human intervention during the workflow execution and can also be used for other purposes, e.g. supporting mutating step states while the workflow is running."

  • One breakpoint fans out across a foreach's parallel instances.
  • The paused state allows in-flight state mutation — an operator can fix a bad parameter / correct drift / adjust an intermediate output, then resume.

Canonical uses

Workflow development

Pause at each step during initial implementation, inspect parameters + outputs, iterate on the step logic without re-running the whole workflow every time.

Foreach iteration debugging

A foreach with 1000 iterations where 17 fail for specific inputs — set a breakpoint on the foreach step; all 1000 pause; operator reviews the failing inputs + decides per-iteration what to do (resume, skip, mutate-and-resume).

Production state correction

An ETL workflow has a bad intermediate value partway through a multi-hour run. Set a breakpoint on the next step; pause; correct the intermediate; resume. Avoids teardown-and-restart.

Manual-approval gates

Although not the primary framing in the post, breakpoints can double as a lightweight approval gate — pause before a critical step, wait for a human to confirm, resume.

Trade-offs

Axis Win Cost
Debugging velocity IDE-like iteration on running workflows Requires persistent resumable step state in the engine
Production safety Avoids teardown-and-restart for state correction State mutation is a very sharp tool — requires audit
Foreach ergonomics One breakpoint → all iterations pause Could inadvertently stall large fan-outs
Tenant isolation Per-instance resume More complex resume state management in the engine

Prerequisites in the orchestrator

  • Persistent + resumable step-runtime state — which Maestro already has for retry / restart support.
  • Cooperative step runtime — steps check-in with the engine and honour pause signals.
  • Per-instance granularity — pausing one tenant's instance doesn't stall unrelated tenants.
  • State-mutation API — gated + audited path to modify paused-step state.

Industry positioning

This pattern is rare — most workflow orchestrators treat debugging as read-only after-the-fact activity:

  • Airflow — clear + re-run tasks; no pause-and-inspect.
  • Step Functions — no breakpoints; workflow re-run from specific state via API.
  • Argo — pause at workflow level, not step level.
  • Temporal — replay debugging via event history (different shape — replay, not live pause).

Maestro's per-step live-pause-and-mutate primitive is distinguishing.

Seen in

Last updated · 319 distilled / 1,201 read