Skip to content

NETFLIX 2024-07-22

Read original ↗

Netflix — Maestro: Netflix's Workflow Orchestrator

Summary

Netflix open-sources Maestro, the horizontally scalable workflow orchestrator that runs hundreds of thousands of workflows, launches thousands of workflow instances daily, executes ~500,000 jobs/day on average and up to ~2,000,000 jobs on peak days (87.5% YoY growth in executed jobs). Maestro is the orchestration backbone under Netflix's Metaflow deployments and more broadly the Workflow-as-a-Service layer for ETL, ML training, A/B-test pipelines, and cross-storage data movement. The article distils five load- bearing design choices: (1) support both acyclic AND cyclic workflows, not just DAGs, plus three reusable composite patterns — foreach, subworkflow, conditional branch; (2) five named workflow run strategies — Sequential (FIFO default), Strict-Sequential (FIFO but block on failure until manual unblock), First-only (drop queued if one running), Last-only (cancel running and keep only latest), Parallel- with-Concurrency-Limit (for backfills); (3) Parameterized workflows enabled by a homemade Simple Expression Language (SEL) — a subset of Java Language Specification with runtime loop-iteration limits, array-size checks, object-memory limits, and Java Security Manager sandbox, exactly to close the "user writes an infinite loop and OOMs the server" code-injection threat; (4) a detailed seven- layer step parameter merging pipeline (defaults → injected → typed-defaults → workflow/step- info → new undefined → step-definition → run/restart overrides); (5) signal-based step dependencies supporting both publish-subscribe and trigger semantics, with "signal lineage" query over historical publish/consume pairs and exactly-once execution guarantee for workflows subscribed to one signal or a set of joined signals. Additional primitives covered: per-step breakpoints (pause- before-step for interactive debugging, shared across foreach iterations), timeline (audit trail of state-machine events per step), retry policies distinguishing platform vs user retries with exponential backoff / fixed interval / zero-retry for non- idempotent steps, aggregated view (merge base aggregated run with current run statuses across multi-run restarts), rollup (recursive leaf-step status counts across nested subworkflows + foreach iterations, eventually consistent), and a two-queue event publishing pipeline (internal Maestro event → internal queue → event processor → external event → external queue such as SNS or Kafka) that bridges Maestro's workflow/instance/step-instance lifecycle changes to downstream event-driven services. Open-sourced at github.com/Netflix/maestro.

Key takeaways

  1. Netflix's Maestro runs ~500K jobs/day on average, up to ~2M on peak days, with 87.5% YoY growth in executed jobs. "Maestro now launches thousands of workflow instances and runs half a million jobs daily on average, and has completed around 2 million jobs on particularly busy days." Horizontally scalable across a single cluster — Netflix's architectural stance is that splitting workflows across clusters "adds unnecessary complexity," because all workflows operate against a single data warehouse. "Since Netflix's data tables are housed in a single data warehouse, we believe a single orchestrator should handle all workflows accessing it." (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator)

  2. Maestro supports both acyclic and cyclic workflows — not just DAGs — and ships three composite primitives built into the engine. "Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc." Direct engine support — rather than encoding these patterns in user code — enables optimisation and a consistent implementation across all workflows. See concepts/dag-vs-cyclic-workflow and patterns/composite-workflow-pattern.

  3. Five predefined run strategies, each matching a distinct operational shape. Sequential (default, FIFO, no state coupling between runs); Strict Sequential (FIFO but block subsequent instances while a prior failure is unresolved — manual mark-unblocked or restart required; "useful for time insensitive but business critical workflows"); First-only (ensure running workflow completes before queueing a new one — "helps to avoid idempotency issues by not queuing new workflow instances"); Last-only (stop running instance when new one arrives — "if a workflow is designed to always process the latest data, such as processing the latest snapshot of an entire table each time"); Parallel with Concurrency Limit (canonical use: backfilling old data within a deadline). (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator)

  4. Parameterized workflows with code injection — powerful, and explicitly threat-modelled. "However, code injection introduces significant security and safety concerns. For example, users might unintentionally write an infinite loop that creates an array and appends items to it, eventually crashing the server with out-of-memory (OOM) issues." Netflix's rejection of the alternative — pushing code into user business logic — is explicit: "this would impose additional work on users and tightly couple their business logic with the workflow. In certain cases, this approach blocks users to design some complex parameterized workflows." First canonical wiki statement that safe code injection is a load-bearing feature of a workflow orchestrator, not an anti-pattern. See patterns/sel-sandboxed-expression-language.

  5. Simple Expression Language (SEL) — a home-grown safe subset of the Java Language Specification. "SEL is a homemade simple, secure, and safe expression language (SEL) to address the risks associated with code injection within Maestro parameterized workflows. It is a simple expression language and the grammar and syntax follow JLS (Java Language Specifications). SEL supports a subset of JLS, focusing on Maestro use cases." Supports datatypes for all Maestro parameter types, error-raising, datetime handling, predefined utility methods. Runtime enforcement of loop-iteration limits, array- size checks, object-memory limits, plus Java Security Manager sandbox. Canonical wiki instance of a domain-specific safe- expression language as an orchestrator primitive. (Source: sources/2024-07-22-netflix-maestro-netflixs-workflow-orchestrator)

  6. Seven-layer step parameter merging pipeline. Parameters merge in deterministic order: (a) default general parameters (workflow_instance_id, step_instance_uuid, step_attempt_id, step_id — reserved, user cannot override); (b) injected parameters from step runtime, dynamically generated from step schema; (c) default typed parameters per step type (e.g. loop_params, loop_index for foreach steps); (d) workflow and step info parameters (identity — e.g. workflow_id); (e) new undefined parameters user supplies at start/restart time; (f) step-definition parameters from the workflow JSON; (g) run/restart parameters overriding everything else. "These two types of parameters are merged at the end so that step runtime can see the most recent and accurate parameter space." See concepts/step-parameter-merging.

  7. Signal-based step dependencies with both publish-subscribe and trigger semantics — and exactly-once guarantee. Signals are messages carrying parameter values, produced either by step outputs or by external systems (SNS / Kafka). A signal-subscribing step unblocks when all its subscribed signals are matched (per a subset of mapped parameter fields, with operators <, >, =, etc.). "Signal triggering guarantees exactly-once execution for the workflow subscribing a signal or a set of joined signals." "Maestro supports 'signal lineage,' which allows users to navigate all historical instances of signals and the workflow steps that match (i.e. publishing or consuming) those signals." Canonical ETL pattern: a table-updating flow publishes a signal containing the partition key; downstream workflows subscribed to that signal/partition combination run exactly once per commit. See patterns/signal-publish-subscribe-step-trigger.

  8. Per-step breakpoints — IDE-style debugging inside a production orchestrator. "Maestro allows users to set breakpoints on workflow steps, functioning similarly to code-level breakpoints in an IDE. When a workflow instance executes and reaches a step with a breakpoint, that step enters a 'paused' state. This halts the workflow graph's progression until a user manually resumes from the breakpoint." Resuming one instance doesn't resume others; deleting the breakpoint resumes all paused instances. Used during initial workflow development and during foreach runs with many input parameters ("Setting a single breakpoint on a step will cause all iterations of the foreach loop to pause"). Also used for "mutating step states while the workflow is running." See patterns/workflow-step-breakpoint.

  9. Retry policies distinguish platform vs user retries. "Maestro distinguishes between two types of retries: 'platform' and 'user.' Platform retries address platform-level errors unrelated to user logic, while user retries are for user-defined conditions. Each type can have its own set of retry policies." Delays support fixed intervals + exponential backoff. Retries default to zero for non-idempotent steps: "Maestro provides the flexibility to set retries to zero for non-idempotent steps to avoid retry."

  10. Two-tier event publishing — internal queue processed within Maestro, external queue (SNS / Kafka) emitted to downstream services. "When workflow definition, workflow instance or step instance is changed, Maestro generates an event, processes it internally and publishes the processed event to external system(s)." Event processor converts internal job events into external events; external events are classified as workflow change events (definition / property change) or instance status change events (workflow-instance / step-instance state transitions). See patterns/internal-external-event-pipeline.

Architectural numbers / production scale

  • ~500,000 jobs/day on average.
  • ~2,000,000 jobs on peak days.
  • Thousands of workflow instances launched per day.
  • Hundreds of thousands of workflows migrated internally with minimal interruption.
  • 87.5% YoY increase in executed jobs over the year preceding this post (from the previous Maestro blog post baseline).
  • Single-cluster, horizontally scalable design rationale tied to Netflix's single data-warehouse architecture.

Systems extracted

Concepts extracted

Patterns extracted

  • patterns/sel-sandboxed-expression-language — safe code injection in an orchestrator via a homemade domain subset of a mainstream language, with runtime limits + platform sandbox.
  • patterns/signal-publish-subscribe-step-trigger — signals serve both publish-subscribe patterns (one producer → many consumers unblocked) and trigger patterns (external event → flow start) with exactly-once semantics.
  • patterns/internal-external-event-pipeline — two-tier event queue: internal events for workflow-engine state machine transitions, external events (after transformation) emitted to SNS / Kafka for downstream services.
  • patterns/workflow-step-breakpoint — per-step pause primitive with per-instance resume, foreach-aware propagation, and production-safe state mutation.
  • patterns/composite-workflow-pattern — foreach / subworkflow / conditional-branch composed into higher-order workflows (e.g. auto-recovery: subworkflow → status-check step → conditional branch → recovery subworkflow → re-run subworkflow).

Caveats

  • Announcement-voice, not a retrospective. The post is the open-sourcing announcement — no specific latency percentiles, storage backend for signal lineage, migration-cutover details, or incident retrospectives. The headline scale numbers (500K / 2M jobs/day, 87.5% YoY) are credible but not accompanied by cluster size, DB storage footprint, or throughput-per-worker data.
  • SEL internals limited to the post. The linked SEL documentation is the deeper source — post covers only the high-level stance.
  • "Cyclic workflow" support stated but not elaborated. Maestro explicitly distinguishes itself from DAG-only orchestrators, but the post does not walk through a cyclic-workflow example or its termination semantics.
  • Aggregated view / rollup consistency model loose. "Due to these processes, the rollup model is eventually consistent." No SLO given; calculation cadence unspecified.
  • Signal lineage storage backend undisclosed. The query "navigate all historical instances of signals and the workflow steps that match" implies a storage + index layer, not disclosed.
  • Workflow-as-JSON is user-written. Workflow definitions are written in JSON; no mention of visual/programmatic builders.
  • Step-runtime interface partly abstract. The runtime interface is described as "a set of basic APIs to control the behavior of a step instance at execution runtime" + "simple data structures to track step runtime state and execution result" — not enumerated.

Source

Last updated · 319 distilled / 1,201 read