PATTERN Cited by 1 source
Workflow step breakpoint¶
Summary¶
Add an IDE-style pause-at-step primitive to a workflow orchestrator. Users set breakpoints on specific steps; when a workflow instance reaches a breakpoint, the step enters a paused state, halting the workflow's progression. An operator can inspect step state, optionally mutate it, and resume on a per-instance basis.
Canonical wiki instance: Netflix Maestro's breakpoint primitive (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator).
Problem¶
Debugging workflows historically falls back to teardown + re-run:
- A bug is discovered mid-run in a long ETL → stop workflow, clear state, fix, re-run from scratch.
- A foreach iteration produces wrong output for specific inputs → no way to pause just the bad iterations for inspection.
- In-flight state has drifted + needs manual correction → no supported mutation path; operators resort to database surgery.
All three reduce operator productivity + risk workflow correctness.
Solution¶
Treat the workflow orchestrator like a programmable IDE debugger with three properties:
1. Per-step breakpoints¶
"Maestro allows users to set breakpoints on workflow steps, functioning similarly to code-level breakpoints in an IDE." (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator)
Breakpoints are attached to step definitions. When any instance reaches that step, execution pauses.
2. Per-instance resume¶
"If multiple instances of a workflow step are paused at a breakpoint, resuming one instance will only affect that specific instance, leaving the others in a paused state. Deleting the breakpoint will cause all paused step instances to resume."
Fine-grained resume separates debugging individual instances from triaging the underlying issue.
3. Foreach-aware + state-mutable¶
"Setting a single breakpoint on a step will cause all iterations of the foreach loop to pause at that step for debugging purposes. Additionally, the breakpoint feature allows human intervention during the workflow execution and can also be used for other purposes, e.g. supporting mutating step states while the workflow is running."
- One breakpoint fans out across a foreach's parallel instances.
- The paused state allows in-flight state mutation — an operator can fix a bad parameter / correct drift / adjust an intermediate output, then resume.
Canonical uses¶
Workflow development¶
Pause at each step during initial implementation, inspect parameters + outputs, iterate on the step logic without re-running the whole workflow every time.
Foreach iteration debugging¶
A foreach with 1000 iterations where 17 fail for specific inputs — set a breakpoint on the foreach step; all 1000 pause; operator reviews the failing inputs + decides per-iteration what to do (resume, skip, mutate-and-resume).
Production state correction¶
An ETL workflow has a bad intermediate value partway through a multi-hour run. Set a breakpoint on the next step; pause; correct the intermediate; resume. Avoids teardown-and-restart.
Manual-approval gates¶
Although not the primary framing in the post, breakpoints can double as a lightweight approval gate — pause before a critical step, wait for a human to confirm, resume.
Trade-offs¶
| Axis | Win | Cost |
|---|---|---|
| Debugging velocity | IDE-like iteration on running workflows | Requires persistent resumable step state in the engine |
| Production safety | Avoids teardown-and-restart for state correction | State mutation is a very sharp tool — requires audit |
| Foreach ergonomics | One breakpoint → all iterations pause | Could inadvertently stall large fan-outs |
| Tenant isolation | Per-instance resume | More complex resume state management in the engine |
Prerequisites in the orchestrator¶
- Persistent + resumable step-runtime state — which Maestro already has for retry / restart support.
- Cooperative step runtime — steps check-in with the engine and honour pause signals.
- Per-instance granularity — pausing one tenant's instance doesn't stall unrelated tenants.
- State-mutation API — gated + audited path to modify paused-step state.
Industry positioning¶
This pattern is rare — most workflow orchestrators treat debugging as read-only after-the-fact activity:
- Airflow — clear + re-run tasks; no pause-and-inspect.
- Step Functions — no breakpoints; workflow re-run from specific state via API.
- Argo — pause at workflow level, not step level.
- Temporal — replay debugging via event history (different shape — replay, not live pause).
Maestro's per-step live-pause-and-mutate primitive is distinguishing.
Seen in¶
- sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator — the canonical breakpoint primitive