Skip to content

CONCEPT Cited by 1 source

Embedded workflow engine

Definition

An embedded workflow engine is a durable-execution engine delivered as a library dependency that runs inside the host service's process, rather than as a separate cluster the host service calls over the network. Workflow state lives in the host service's existing database; execution happens on the host service's JVM (or equivalent runtime) on a dedicated thread pool. There is no orchestration cluster to deploy, no network round-trips per activity, and no Tier 0 critical dependency added to the host service's failure model.

Skipper (Airbnb, 2026-04-28) is the canonical wiki instance. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Structural properties

  • Library, not service. Added to the host service's build (Maven / Gradle artifact for JVM instances); no separate deployment artifact, no separate scaling tier.
  • Shares the host service's persistence. The workflow engine uses whichever database the host service already depends on (Skipper supports MySQL and Airbnb's internal UDS; at peak runs on DynamoDB). No new storage tier operational burden.
  • Shares the host service's lifecycle. Workflow execution starts + stops with the host service's process. Multi-host durability comes from the persistent scheduler + lease-period replay on any healthy host, not from dedicated worker pools.
  • Scoped to one service's JVM. Cross-language support and cross-service orchestration are explicit non-goals; those call for a separate Temporal-class cluster.

Why adopt this shape

Airbnb's explicit rationale (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine):

"External orchestration engines are the industry gold standard for durable workflow execution, providing exactly-once semantics and battle-tested reliability. However, they require dedicated infrastructure — a cluster of servers and a persistence layer, along with operational expertise — to maintain. For our highest-criticality, 'Tier 0' services (services which directly impact user-facing transactions), adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows."

The embedded shape's distinctive property is that a workflow engine outage cannot be a separate incident — if the host service is up, the workflow engine is up; if the host service is down, the workflows that were running on it are hibernating, to be picked up by whichever replica comes back.

Contrast with external orchestration clusters

Property Embedded (Skipper) External (Temporal)
Deployment Library dependency Separate cluster + persistence
Persistence Host service's DB Dedicated store (concepts/temporal-persistence-layer)
Cross-service orchestration No Yes
Cross-language No (JVM-only for Skipper) Yes
Per-activity coordination In-process Network round-trip to cluster
Happy-path overhead ~0 (batched DB writes) Coordinator round-trips per activity
Tier 0 dependency added No Yes (new critical cluster)
Full event-history audit No (state-field replay) Yes
Operational burden Library upgrade Cluster ops + DB ops

(Sources: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine canonicalises the embedded column; earlier Temporal sources — sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform, sources/2025-02-12-flyio-the-exit-interview-jp-phillips, sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple — canonicalise the external column.)

Design invariants this imposes

  • Replay from checkpointed actions — the durability mechanism is forced to be replay-based because the only handles on state are the host DB rows + checkpointed action outputs.
  • Workflow determinism — replay requires the workflow method to be deterministic; side effects move to checkpointed actions.
  • At-least-once action execution — crash-after-execute-before-checkpoint is possible; actions must be idempotent.
  • Performance-neutrality — the library must run on separate thread pools + respect concurrency limits so it doesn't compete with the host service's request path.

Fragments-to-cohesion argument

The "domain logic fragmentation" argument in the Skipper post is distinct from the infrastructure argument: teams writing multi-step flows with queues + callbacks end up with business logic scattered across queue consumers, scheduled jobs, reconciliation scripts, and endpoint handlers. An embedded workflow engine collapses the fragments into one class whose @WorkflowMethod reads like the process it represents. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

This isn't unique to embedded engines — Temporal makes the same argument — but it's the second load-bearing reason Airbnb cites alongside the Tier-0-dependency avoidance.

Tradeoffs

  • Cross-language / cross-service scope traded away. If a workflow must orchestrate calls in Python + JVM + Go, or span two teams' services as one transactional unit, the embedded shape doesn't fit; a separate orchestration cluster does. Airbnb names this explicitly.
  • No central event-history audit surface. Skipper's state-field-only replay traded auditability for runtime efficiency. An investigator cannot replay the full sequence of events a workflow experienced; only current state + last checkpointed action outputs.
  • Library version skew. Each service deploys its own Skipper version; operator tools / migration tooling must tolerate drift across the fleet. (Not yet discussed in the post; implicit in any embedded-library model.)

Seen in

  • sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine — canonical wiki instance. Skipper is the embedded workflow-engine primitive: library dependency, shares host DB, no cluster, in-process execution on a dedicated thread pool, ~0 happy-path overhead. Used across 15+ Airbnb teams (insurance, payments, media, infrastructure, incentives, wallet); peaks at 10 000 workflows / second on DynamoDB. Airbnb's explicit rationale names the rejected alternatives (external clusters, cloud-managed workflow services, homegrown queue-based systems).
Last updated · 433 distilled / 1,256 read