Skip to content

AIRBNB 2026-04-28 Tier 2

Read original ↗

Airbnb — Skipper: Building Airbnb's embedded workflow engine

Summary

Airbnb Engineering (Ricardo Gamba and Andriy Sergiyenko, 2026-04-28) describes Skipper, a Java/Kotlin library that provides durable execution as an embedded library inside each service rather than as a separate orchestration cluster like Temporal or Cadence. Workflows and their side-effectful actions are plain annotated classes; Skipper persists state in the host service's existing database (MySQL or Airbnb's internal Unified Data Store), replays the workflow method from the start on crash while returning previously checkpointed action results instantly, and exposes compensation, signals, and durable waitUntil as first-class primitives. The post's core framing is that external orchestration clusters add a Tier 0 dependency that the authors wanted to avoid; the embedded model trades cross-language support and cross-service orchestration for operational simplicity. Skipper has run in production for more than a year, powers 15+ use cases across insurance, payments, media, infrastructure, incentives, and wallet teams, and at peak has scaled to 10,000 workflows per second on Amazon DynamoDB.

Key takeaways

  • Durable execution without a new Tier 0 dependency. Airbnb explicitly rejected external orchestration engines because "adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Cloud-managed workflow services were rejected for vendor lock-in + the same critical-dependency concern. Homegrown queue-based systems "avoid external dependencies but trade them for bespoke complexity: each team implementing and maintaining its own retry logic, state management, and compensation flows." Skipper is the middle option: an embedded library sharing the host service's lifecycle.

  • The domain-logic fragmentation argument. "When teams wire up multi-step processes using queues or ad-hoc async plumbing, the domain logic ends up fragmented. A single business workflow, such as processing an insurance claim, gets scattered across queue consumers, scheduled jobs, callback endpoints, and reconciliation scripts. There's no single place in the code where you can read what the business process actually does." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Skipper's programming model collapses a business workflow into one class whose @WorkflowMethod reads like the process it represents.

  • Replay is the durability mechanism. "When a workflow starts, Skipper executes the workflow method and checkpoints each action's result to the database. If the workflow needs to wait (via waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources. When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonical instance of replay from checkpointed actions.

  • State fields, not event logs. "Unlike event-sourced orchestration systems that reconstruct state by replaying an entire event history, Skipper persists state fields directly. There's no event log to replay, just current state and checkpointed action results. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Divergence from Temporal's event-history-replay model.

  • The happy path has near-zero overhead. Skipper's pitch is that most workflow engines impose overhead on every execution. "External orchestration engines require network round-trips to a central cluster for every activity invocation — the worker executes the activity, then calls back to the cluster to persist the result before the workflow can advance. This is fundamental to their architecture; the cluster is the coordinator." In contrast: "When a workflow starts, two things happen at the database level: the workflow instance is created, and a delayed timeout task is scheduled as a durability guarantee. Then the workflow executes entirely in-process. Actions run as normal method calls on an in-memory execution queue on a dedicated thread pool, checkpoints are batched, and the workflow can run to completion without any further coordination." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonicalises delayed timeout task as crash safety net.

  • Determinism is a replay correctness invariant. "Replay imposes one key constraint: workflow methods must be deterministic. Given the same inputs, checkpointed action results, and state fields, the workflow must make the same decisions and call actions in the same order. All side effects, such as API calls, time-dependent logic, and randomness, belong in actions, never in the workflow method directly" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonical framing of the determinism requirement for replay.

  • Compensation as a first-class primitive. The @Compensate annotation pairs each action with an undo method; on non-retryable failure Skipper "automatically executes compensation methods in reverse order (releasing held inventory, refunding charges, reverting state changes), walking the system back to a consistent state. Developers express what 'undo' means for each action; Skipper handles the orchestration of when and in what order the undos run. The result is eventual consistency without distributed transactions, and workflow code that stays focused on the business process rather than cleanup choreography" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). First-class language-level realisation of the saga compensating- action idiom.

  • Signals decouple external events from workflow progress. "Signals (@SignalMethod) let external events push data into a running workflow, updating @StateField fields that the workflow's waitUntil conditions evaluate against" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). A workflow that "waits for approval, then processes payment, then sends confirmation" reads linearly; the external event stream updates state without the workflow author touching queue plumbing.

  • The programming model is deliberately minimal. "By exposing workflows and actions as plain Java/Kotlin classes, with a minimal, annotation-based contract, Skipper enables developers to write business logic that looks like business logic, not framework boilerplate … Skipper isn't the first system to offer 'write normal code, get durable execution' — other workflow engines do as well — but Skipper's focus is on removing the adoption friction: fewer required constructs and less setup, so teams that use Java/Kotlin can get to a first durable workflow with minimal ceremony" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Canonicalises workflow primitives as annotated classes.

  • Five design principles. "Succinct ergonomics. No single point of failure. Leverage existing dependencies. Self-service ready. Performance-neutral" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Performance-neutrality via "separate thread pools, configurable concurrency limits, and efficient hibernation patterns to coexist peacefully with latency-sensitive request handling" — the guarantee that adopting Skipper doesn't compete with the host service's request path.

  • Embedded-model tradeoffs. "The replay model requires deterministic workflow methods, which can be unintuitive for developers new to the pattern … Actions may execute more than once in edge cases (crash after execution but before checkpoint). Actions should be idempotent … Changing a workflow's structure can break in-flight workflows. Teams need versioning strategies for workflow evolution" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). At-least-once action execution is an explicit property; idempotency is the application's responsibility.

  • Production impact: 15+ use cases, 10 000 workflows / sec on DynamoDB. "Skipper has been running in production for more than a year, powering 15+ use cases across insurance, payments, media, infrastructure, incentives, and wallet teams … The Media Foundation team uses Skipper to coordinate video processing pipelines — validation, transcoding, thumbnail generation — surviving pod restarts across multi-hour jobs. Infrastructure teams rely on it for durable Flink job lifecycle management and reliable data pipeline CRUD operations … At peak, Skipper has scaled to 10,000 workflows per second on Amazon DynamoDB, enabled by its lean execution model" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine).

  • What the authors would reconsider. "Workflow evolution remains the biggest friction point. While we have versioning patterns (create new method versions, migrate traffic, deprecate old versions), better tooling — automated compatibility checking, migration assistants, runtime versioning support — would smooth the experience. Debugging replayed workflows also requires mental model adjustment: engineers must understand that log timestamps and call sequences reflect replays, not original execution. Better observability tooling, particularly replay visualization, would help" (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine). Honest operator-experience caveats on top of the happy-path pitch.

Programming model — the Workflow / Action split

Skipper's two abstractions:

  • Workflows — orchestration logic. Plain Java/Kotlin class extending Workflow. The method annotated @WorkflowMethod holds the end-to-end business process. Fields annotated @StateField (or @StateParam) persist across replays. Methods annotated @SignalMethod let external events mutate state.
  • Actions — side-effectful operations (API calls, DB writes, notifications). Class extends Actions. Methods annotated @Execute(checkpoint = true) are checkpointed: "the result of an action survives crashes and restarts." Methods annotated @Compensate are the undo pair.

Invoked from outside the workflow as a typed, codegen-free call:

val out = workflow<ChargeAndAccept>("reservation:${req.id}").execute(req)

Durable wait is a first-class primitive:

waitUntil { paymentCaptured }     // hibernates until signal arrives
waitUntil({ photosApproved != null }, Duration.ofHours(24))  // with timeout

(Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Architecture — the happy path

At workflow start, Skipper does two database writes:

  1. Persist the workflow instance row.
  2. Schedule a delayed timeout task as a durability safety net.

Then execution runs entirely in-process on a dedicated thread pool with an in-memory execution queue. Checkpoints are batched. If the process crashes mid-workflow, "the persistent scheduler picks up the workflow after a lease period expires and replays it." If the workflow completes normally, the timeout task fires harmlessly and is discarded. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

This means Skipper adds "very little overhead (just a few database writes)" on the happy path — the engine only earns its keep on crash, waitUntil hibernation, or compensation.

Operational numbers

  • 15+ use cases in production (insurance, payments, media, infrastructure, incentives, wallet).
  • >1 year in production at time of writing (2026-04-28).
  • 10 000 workflows per second peak, on Amazon DynamoDB.
  • Multi-hour Media Foundation video-processing jobs survive pod restarts.
  • Days to weeks for scheduled financial operations (claim processing, policy lifecycle).

(Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)

Caveats

  • No public open-source release is claimed in the post. Skipper is an internal Airbnb library; this is a design-disclosure post, not a release announcement. External practitioners can read the ideas but can't adopt the code.
  • Event-log contrast vs Temporal is authorial framing. Skipper persists "state fields directly … no event log to replay"; Temporal maintains full event history. Airbnb explicitly notes "it trades some auditability for that efficiency" — teams needing step-by-step auditability may prefer Temporal's model.
  • No cross-language or cross-service orchestration. The embedded-library shape scopes Skipper to JVM workflows running inside one service. The post names this: "teams needing cross-language support or cross-service orchestration may find a dedicated orchestration system more appropriate."
  • Determinism is operationally unintuitive. The post flags replay-debugging as a mental-model shift; "log timestamps and call sequences reflect replays, not original execution." No public disclosure of what determinism-violation tooling looks like (linter? runtime checker?).
  • Workflow evolution is manual. Versioning patterns ("create new method versions, migrate traffic, deprecate old versions") are acknowledged as the biggest friction point without better tooling.
  • At-least-once action execution. Idempotency is the application's responsibility; the post names this tradeoff explicitly ("Actions may execute more than once in edge cases … Actions should be idempotent").
  • No per-use-case operational numbers. The 10 k wf/s peak is cited but per-team latency budgets, replay-frequency, or compensation-rate telemetry are not disclosed.

Source

Last updated · 433 distilled / 1,256 read