SYSTEM Cited by 1 source
Skipper (Airbnb)¶
Skipper is Airbnb's internal Java/Kotlin workflow engine for durable execution, disclosed in Skipper: Building Airbnb's embedded workflow engine (Ricardo Gamba and Andriy Sergiyenko, Airbnb Engineering, 2026-04-28). Unlike external orchestration clusters (Temporal, Cadence, AWS Step Functions), Skipper is embedded as a library inside each host service, sharing the service's lifecycle and its existing database (MySQL or Airbnb's internal Unified Data Store). (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
What it is¶
A workflow engine packaged as a library dependency, not a deployable service. Add the artifact to a host service's build, configure it against the service's existing database, and start defining workflows as plain Java/Kotlin classes. The engine runs in the host service's JVM on a dedicated thread pool, with no separate cluster to operate, deploy, or page on.
Programming model¶
Two abstractions:
- Workflow — a class extending
Workflow. One@WorkflowMethodholds the end-to-end orchestration logic. Fields annotated@StateField(or@StateParam) persist across replays. Methods annotated@SignalMethodlet external events mutate state;waitUntil { cond }is a first-class durable hibernation primitive with an optional timeout. - Actions — a class extending
Actions. Methods annotated@Execute(checkpoint = true)are checkpointed on success: the result survives crashes and returns instantly on replay. Methods annotated@Compensatepair undo logic with the action it reverses.
Invocation is a typed, codegen-free call:
Canonical programming-model example (from the post):
class ListingPublicationWorkflow : Workflow() {
private val actions = actions<ListingActions>()
@StateField val photosApproved: Boolean? = false
@WorkflowMethod
suspend fun publishListing(submission: ListingSubmission): PublicationResult {
val reviewId = actions.submitPhotosForReview(submission.getListingId())
val reviewTimedOut = waitUntil({ photosApproved != null }, Duration.ofHours(24))
if (reviewTimedOut || !photosApproved) {
actions.notifyHost(submission.getHostId(), "Photos require updates")
return PublicationResult.rejected("Photo review failed")
}
actions.activateListing(submission.getListingId())
actions.notifyHost(submission.getHostId(), "Your listing is now live!")
return PublicationResult.success(submission.getListingId())
}
@SignalMethod
fun completePhotoReview(approved: Boolean) {
photosApproved = approved
}
}
The post's key pitch: "This code reads naturally: submit photos, wait for photo review, publish. There's no retry logic, queue management, or async coordination visible in the workflow itself." Canonical instance of patterns/workflow-primitives-as-annotated-classes.
Durability mechanism — replay from checkpointed actions¶
"When a workflow starts, Skipper executes the workflow method and checkpoints each action's result to the database. If the workflow needs to wait (via
waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources. When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly. The workflow picks up from where it left off." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Distinct from event-sourced orchestrators (Temporal's event-history model): Skipper "persists state fields directly. There's no event log to replay, just current state and checkpointed action results. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency." See concepts/workflow-replay-from-checkpointed-actions.
The happy path — near-zero overhead¶
On workflow start, Skipper performs two database writes: persist the workflow instance row, and schedule a delayed timeout task as a durability safety net. Then the workflow executes entirely in-process on a dedicated thread pool with an in-memory action queue. Checkpoints are batched. If the process crashes mid-workflow, "the persistent scheduler picks up the workflow after a lease period expires and replays it." If the workflow completes normally, the timeout task fires harmlessly and is discarded. (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
"The engine is only called into action when something goes
wrong: a crash triggers a replay, a waitUntil hibernates the
workflow, or an error invokes compensation." Canonicalised at
patterns/delayed-timeout-task-as-crash-safety-net.
Determinism as correctness invariant¶
"Replay imposes one key constraint: workflow methods must be deterministic. Given the same inputs, checkpointed action results, and state fields, the workflow must make the same decisions and call actions in the same order. All side effects, such as API calls, time-dependent logic, and randomness, belong in actions, never in the workflow method directly." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
See concepts/workflow-determinism-requirement.
Compensation¶
The @Compensate annotation pairs each action with a method
that undoes its effect. If an action fails after earlier
actions have committed, Skipper "automatically executes
compensation methods in reverse order (releasing held
inventory, refunding charges, reverting state changes),
walking the system back to a consistent state." (Source:
sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Language-level realisation of the
saga compensating-action
idiom. Canonicalised at
concepts/workflow-compensation-action.
Error handling¶
Skipper distinguishes:
- Retryable errors (network timeouts, transient backend failures) — retried automatically with configurable backoff.
- Non-retryable errors (declined card, business-rule rejection) — halt the workflow's normal flow and invoke compensation walk-back.
Persistence backends¶
- MySQL — for services whose primary store is MySQL.
- Unified Data Store (UDS) — Airbnb's internal data store (see systems/airbnb-uds).
- DynamoDB — the backing store behind the 10 000-workflow- per-second peak number ("At peak, Skipper has scaled to 10,000 workflows per second on Amazon DynamoDB, enabled by its lean execution model.").
Each service's Skipper deployment uses whichever of these the host service already depends on; Skipper adds no new data stores or failure modes.
Production footprint¶
At time of the post (2026-04-28):
- In production for more than a year.
- 15+ use cases across insurance, payments, media, infrastructure, incentives, and wallet teams.
- Specific teams named: Media Foundation (video-processing pipelines surviving pod restarts across multi-hour jobs); Infrastructure (durable Flink job-lifecycle management + data pipeline CRUD); insurance (multi-step claim processing); payments (resilient transaction orchestration, scheduled financial operations spanning days or weeks); wallet.
- 10 000 workflows / second peak on DynamoDB.
Five design principles¶
- Succinct ergonomics — workflow code reads like the business logic it represents.
- No single point of failure — embedded per service, no central coordinator.
- Leverage existing dependencies — uses the host service's existing database.
- Self-service ready — library dependency, no central team engagement required.
- Performance-neutral — separate thread pools, configurable concurrency limits, efficient hibernation to coexist with latency-sensitive request handling.
Tradeoffs¶
- Determinism requirement — unintuitive for developers new to the pattern. Non-determinism in the workflow method (clock reads, random numbers, direct API calls) breaks replay correctness.
- At-least-once action execution — "Actions may execute more than once in edge cases (crash after execution but before checkpoint). Actions should be idempotent." The application is responsible for idempotency.
- Workflow evolution complexity — changing a workflow's structure can break in-flight workflows. Versioning patterns exist (new method versions, traffic migration, deprecation) but tooling is acknowledged as the biggest friction point.
- No cross-language support — JVM-only.
- No cross-service orchestration — a workflow lives in one service's process; multi-service sagas require explicit API calls in actions or a separate orchestration layer.
"These tradeoffs are inherent to the embedded model; teams needing cross-language support or cross-service orchestration may find a dedicated orchestration system more appropriate." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Contrast with external orchestration clusters¶
- Temporal / Cadence — run as dedicated clusters with their own persistence layer (concepts/temporal-persistence-layer); event-history replay model with full auditability; cross-language workers; network round-trips per activity for checkpoint coordination.
- AWS Step Functions — managed serverless workflows defined in Amazon States Language; no ops burden but vendor lock-in + regulatory constraints.
- Skipper — library; shares host DB; state-field replay (no event log); JVM-only; in-process execution with zero coordination round-trips on the happy path.
Airbnb's explicit rationale for not using Temporal / Cloud-managed workflows: "For our highest-criticality, 'Tier 0' services (services which directly impact user-facing transactions), adding a new critical dependency was problematic. An orchestration cluster outage would mean every dependent service would lose the ability to start or advance workflows." (Source: sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine.)
Lessons learned (2026)¶
- Worked well: embedded model reduces operational burden; multiple storage backends remove adoption blockers; simple API ("actions are checkpointed, workflows must be deterministic") accelerates learning.
- Would reconsider: workflow evolution as the biggest friction point; better tooling for compatibility checking, migration assistants, runtime versioning. Debugging replayed workflows requires a mental-model shift — "engineers must understand that log timestamps and call sequences reflect replays, not original execution"; replay-visualisation tooling would help.
Seen in¶
- sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine
— canonical disclosure. Five design principles, Workflow/Action
programming model with
@WorkflowMethod/@StateField/@SignalMethod/@Execute(checkpoint = true)/@Compensateannotations, replay mechanism, happy-path delayed-timeout-task safety net, determinism invariant, compensation walk-back, MySQL + UDS + DynamoDB backends, 15+ production use cases, 10 000 wf/s peak, multi-hour media jobs.
Related¶
- companies/airbnb — origin organisation.
- concepts/durable-execution — the parent property Skipper operationalises.
- concepts/embedded-workflow-engine — the architectural primitive Skipper canonicalises.
- concepts/workflow-replay-from-checkpointed-actions — the durability mechanism.
- concepts/workflow-determinism-requirement — invariant required by replay.
- concepts/workflow-compensation-action —
@Compensateprimitive. - concepts/workflow-signal —
@SignalMethodprimitive. - concepts/fault-tolerant-long-running-workflow — class of problem Skipper solves.
- patterns/workflow-primitives-as-annotated-classes — the programming-model pattern.
- patterns/delayed-timeout-task-as-crash-safety-net — happy-path zero-overhead mechanism.
- patterns/saga-over-long-transaction — sibling compensation idiom at the database-transaction altitude.
- patterns/checkpoint-resumable-fiber — sibling durable- execution pattern in a different ecosystem (Cloudflare Project Think, actor-in-fiber rather than library-in-service).
- systems/temporal — external-cluster alternative that Skipper explicitly contrasts itself against.
- systems/cadence — Temporal's predecessor.
- systems/aws-step-functions — managed-service alternative explicitly rejected for vendor lock-in + Tier 0 dependency.
- systems/mysql / systems/airbnb-uds / systems/dynamodb — pluggable persistence backends.