Airbnb — Skipper: Building Airbnb's embedded workflow engine¶
Summary¶
Airbnb Engineering (Ricardo Gamba and Andriy Sergiyenko,
2026-04-28) describes Skipper, a Java/Kotlin library that
provides durable execution as an
embedded library inside each service rather than as a separate
orchestration cluster like Temporal or
Cadence. Workflows and their side-effectful
actions are plain annotated classes; Skipper persists state in the
host service's existing database (MySQL or Airbnb's internal
Unified Data Store), replays the workflow method from the start on
crash while returning previously checkpointed action results
instantly, and exposes compensation, signals, and durable
waitUntil as first-class primitives. The post's core framing is
that external orchestration clusters add a Tier 0 dependency that
the authors wanted to avoid; the embedded model trades
cross-language support and cross-service orchestration for
operational simplicity. Skipper has run in production for more
than a year, powers 15+ use cases across insurance,
payments, media, infrastructure, incentives, and wallet teams, and
at peak has scaled to 10,000 workflows per second on Amazon
DynamoDB.
Key takeaways¶
-
Durable execution without a new Tier 0 dependency. Airbnb explicitly rejected external orchestration engines because "adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Cloud-managed workflow services were rejected for vendor lock-in + the same critical-dependency concern. Homegrown queue-based systems "avoid external dependencies but trade them for bespoke complexity: each team implementing and maintaining its own retry logic, state management, and compensation flows." Skipper is the middle option: an embedded library sharing the host service's lifecycle.
-
The domain-logic fragmentation argument. "When teams wire up multi-step processes using queues or ad-hoc async plumbing, the domain logic ends up fragmented. A single business workflow, such as processing an insurance claim, gets scattered across queue consumers, scheduled jobs, callback endpoints, and reconciliation scripts. There's no single place in the code where you can read what the business process actually does." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Skipper's programming model collapses a business workflow into one class whose
@WorkflowMethodreads like the process it represents. -
Replay is the durability mechanism. "When a workflow starts, Skipper executes the workflow method and checkpoints each action's result to the database. If the workflow needs to wait (via
waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources. When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonical instance of replay from checkpointed actions. -
State fields, not event logs. "Unlike event-sourced orchestration systems that reconstruct state by replaying an entire event history, Skipper persists state fields directly. There's no event log to replay, just current state and checkpointed action results. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Divergence from Temporal's event-history-replay model.
-
The happy path has near-zero overhead. Skipper's pitch is that most workflow engines impose overhead on every execution. "External orchestration engines require network round-trips to a central cluster for every activity invocation — the worker executes the activity, then calls back to the cluster to persist the result before the workflow can advance. This is fundamental to their architecture; the cluster is the coordinator." In contrast: "When a workflow starts, two things happen at the database level: the workflow instance is created, and a delayed timeout task is scheduled as a durability guarantee. Then the workflow executes entirely in-process. Actions run as normal method calls on an in-memory execution queue on a dedicated thread pool, checkpoints are batched, and the workflow can run to completion without any further coordination." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonicalises delayed timeout task as crash safety net.
-
Determinism is a replay correctness invariant. "Replay imposes one key constraint: workflow methods must be deterministic. Given the same inputs, checkpointed action results, and state fields, the workflow must make the same decisions and call actions in the same order. All side effects, such as API calls, time-dependent logic, and randomness, belong in actions, never in the workflow method directly" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonical framing of the determinism requirement for replay.
-
Compensation as a first-class primitive. The
@Compensateannotation pairs each action with an undo method; on non-retryable failure Skipper "automatically executes compensation methods in reverse order (releasing held inventory, refunding charges, reverting state changes), walking the system back to a consistent state. Developers express what 'undo' means for each action; Skipper handles the orchestration of when and in what order the undos run. The result is eventual consistency without distributed transactions, and workflow code that stays focused on the business process rather than cleanup choreography" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). First-class language-level realisation of the saga compensating- action idiom. -
Signals decouple external events from workflow progress. "Signals (@SignalMethod) let external events push data into a running workflow, updating @StateField fields that the workflow's
waitUntilconditions evaluate against" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). A workflow that "waits for approval, then processes payment, then sends confirmation" reads linearly; the external event stream updates state without the workflow author touching queue plumbing. -
The programming model is deliberately minimal. "By exposing workflows and actions as plain Java/Kotlin classes, with a minimal, annotation-based contract, Skipper enables developers to write business logic that looks like business logic, not framework boilerplate … Skipper isn't the first system to offer 'write normal code, get durable execution' — other workflow engines do as well — but Skipper's focus is on removing the adoption friction: fewer required constructs and less setup, so teams that use Java/Kotlin can get to a first durable workflow with minimal ceremony" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonicalises workflow primitives as annotated classes.
-
Five design principles. "Succinct ergonomics. No single point of failure. Leverage existing dependencies. Self-service ready. Performance-neutral" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Performance-neutrality via "separate thread pools, configurable concurrency limits, and efficient hibernation patterns to coexist peacefully with latency-sensitive request handling" — the guarantee that adopting Skipper doesn't compete with the host service's request path.
-
Embedded-model tradeoffs. "The replay model requires deterministic workflow methods, which can be unintuitive for developers new to the pattern … Actions may execute more than once in edge cases (crash after execution but before checkpoint). Actions should be idempotent … Changing a workflow's structure can break in-flight workflows. Teams need versioning strategies for workflow evolution" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). At-least-once action execution is an explicit property; idempotency is the application's responsibility.
-
Production impact: 15+ use cases, 10 000 workflows / sec on DynamoDB. "Skipper has been running in production for more than a year, powering 15+ use cases across insurance, payments, media, infrastructure, incentives, and wallet teams … The Media Foundation team uses Skipper to coordinate video processing pipelines — validation, transcoding, thumbnail generation — surviving pod restarts across multi-hour jobs. Infrastructure teams rely on it for durable Flink job lifecycle management and reliable data pipeline CRUD operations … At peak, Skipper has scaled to 10,000 workflows per second on Amazon DynamoDB, enabled by its lean execution model" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine).
-
What the authors would reconsider. "Workflow evolution remains the biggest friction point. While we have versioning patterns (create new method versions, migrate traffic, deprecate old versions), better tooling — automated compatibility checking, migration assistants, runtime versioning support — would smooth the experience. Debugging replayed workflows also requires mental model adjustment: engineers must understand that log timestamps and call sequences reflect replays, not original execution. Better observability tooling, particularly replay visualization, would help" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Honest operator-experience caveats on top of the happy-path pitch.
Programming model — the Workflow / Action split¶
Skipper's two abstractions:
- Workflows — orchestration logic. Plain Java/Kotlin class
extending
Workflow. The method annotated@WorkflowMethodholds the end-to-end business process. Fields annotated@StateField(or@StateParam) persist across replays. Methods annotated@SignalMethodlet external events mutate state. - Actions — side-effectful operations (API calls, DB
writes, notifications). Class extends
Actions. Methods annotated@Execute(checkpoint = true)are checkpointed: "the result of an action survives crashes and restarts." Methods annotated@Compensateare the undo pair.
Invoked from outside the workflow as a typed, codegen-free call:
Durable wait is a first-class primitive:
waitUntil { paymentCaptured } // hibernates until signal arrives
waitUntil({ photosApproved != null }, Duration.ofHours(24)) // with timeout
(Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Architecture — the happy path¶
At workflow start, Skipper does two database writes:
- Persist the workflow instance row.
- Schedule a delayed timeout task as a durability safety net.
Then execution runs entirely in-process on a dedicated thread pool with an in-memory execution queue. Checkpoints are batched. If the process crashes mid-workflow, "the persistent scheduler picks up the workflow after a lease period expires and replays it." If the workflow completes normally, the timeout task fires harmlessly and is discarded. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
This means Skipper adds "very little overhead (just a few
database writes)" on the happy path — the engine only earns its
keep on crash, waitUntil hibernation, or compensation.
Operational numbers¶
- 15+ use cases in production (insurance, payments, media, infrastructure, incentives, wallet).
- >1 year in production at time of writing (2026-04-28).
- 10 000 workflows per second peak, on Amazon DynamoDB.
- Multi-hour Media Foundation video-processing jobs survive pod restarts.
- Days to weeks for scheduled financial operations (claim processing, policy lifecycle).
(Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Caveats¶
- No public open-source release is claimed in the post. Skipper is an internal Airbnb library; this is a design-disclosure post, not a release announcement. External practitioners can read the ideas but can't adopt the code.
- Event-log contrast vs Temporal is authorial framing. Skipper persists "state fields directly … no event log to replay"; Temporal maintains full event history. Airbnb explicitly notes "it trades some auditability for that efficiency" — teams needing step-by-step auditability may prefer Temporal's model.
- No cross-language or cross-service orchestration. The embedded-library shape scopes Skipper to JVM workflows running inside one service. The post names this: "teams needing cross-language support or cross-service orchestration may find a dedicated orchestration system more appropriate."
- Determinism is operationally unintuitive. The post flags replay-debugging as a mental-model shift; "log timestamps and call sequences reflect replays, not original execution." No public disclosure of what determinism-violation tooling looks like (linter? runtime checker?).
- Workflow evolution is manual. Versioning patterns ("create new method versions, migrate traffic, deprecate old versions") are acknowledged as the biggest friction point without better tooling.
- At-least-once action execution. Idempotency is the application's responsibility; the post names this tradeoff explicitly ("Actions may execute more than once in edge cases … Actions should be idempotent").
- No per-use-case operational numbers. The 10 k wf/s peak is cited but per-team latency budgets, replay-frequency, or compensation-rate telemetry are not disclosed.
Source¶
- Original: https://medium.com/airbnb-engineering/skipper-building-airbnbs-embedded-workflow-engine-f6c34552146f?source=rss----53c7c27702d5---4
- Raw markdown:
raw/airbnb/2026-04-28-skipper-building-airbnbs-embedded-workflow-engine-29f5842a.md
Related¶
- systems/airbnb-skipper — the system itself.
- companies/airbnb — source company.
- concepts/embedded-workflow-engine — the architectural primitive (library-in-service vs external cluster) this post canonicalises.
- concepts/workflow-replay-from-checkpointed-actions — Skipper's durability mechanism.
- concepts/workflow-determinism-requirement — invariant required by the replay model.
- concepts/workflow-compensation-action —
@Compensateprimitive. - concepts/workflow-signal —
@SignalMethodprimitive. - concepts/durable-execution — parent property Skipper operationalises.
- concepts/fault-tolerant-long-running-workflow — the class of problem.
- patterns/workflow-primitives-as-annotated-classes — the programming-model pattern.
- patterns/delayed-timeout-task-as-crash-safety-net — the happy-path-no-overhead mechanism.
- patterns/saga-over-long-transaction — sibling compensation-action shape at the database-transaction altitude.
- patterns/checkpoint-resumable-fiber — sibling durable- execution pattern at the actor-in-fiber altitude (Cloudflare Project Think).
- systems/temporal — external orchestration cluster that Skipper explicitly contrasts itself against.
- systems/cadence — Temporal's predecessor; same class.
- systems/aws-step-functions — managed-service alternative that Skipper explicitly rejects for vendor-lock-in + Tier 0 dependency reasons.