Skip to content

CONCEPT Cited by 1 source

Idempotent job design

An idempotent job is one that can be run multiple times for the same input and produces the same end-state as running it once. Required discipline whenever a job can be enqueued more than once — which in practice includes every async-job system with retries, schedulers, or data-loss recovery paths.

Why jobs get enqueued multiple times

In a self-healing job queue architecture, three independent sources can enqueue the same job for the same record:

  1. User-triggered enqueue — the original code path (e.g. user creates a database → DatabaseCreationJob.perform_async(id)).
  2. Scheduler re-enqueue — the paired scheduler sees the record still in pending state and re-enqueues.
  3. Sidekiq retry — a job that raised fails and is retried via the framework's retry policy.

All three can fire for the same record concurrently (or overlapping). Even without a self-healing design,

3 alone is sufficient to require idempotence: Sidekiq's

default is 25 retries with exponential backoff, and a job that succeeds partially then raises will retry from the start.

Three layers of protection

From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:

"1. Exit quickly — We store state in our database and quickly exit a job if it no longer needs to be run."

"2. Use database locks — We avoid race conditions, such as when multiple jobs are updating the same data at once."

"3. Use sidekiq unique jobs — Sidekiq Enterprise includes the ability to have unique jobs. This will stop a duplicate job from ever being enqueued."

These three compose, each addressing a distinct failure mode:

Layer 1: state re-check at job entry

def perform(id)
  user = User.find(id)
  return unless user.pending?
  # ... real work
end

Cheapest layer. Protects against: the work having already been done by a previous (successful) run of the same job on the same record. Any subsequent re-enqueue just exits fast. No DB mutation required.

What it doesn't protect against: two workers both reading pending? as true simultaneously and both proceeding past the check. Needs layer 2 for that.

Layer 2: database row lock around mutation

backup.with_lock do
  backup.restore_from_backup!
end

ActiveRecord::Base#with_lock issues SELECT ... FOR UPDATE inside a transaction, serialising the row across workers. Protects against: concurrent workers both entering a critical section and both performing the mutation.

What it doesn't protect against: the same worker re-running the same job and the lock releasing between runs. Needs layer 1 for that. And two different jobs of the same class being enqueued and sitting in the queue simultaneously — needs layer 3.

Layer 3: framework-level unique jobs

class CheckDeploymentStatusJob < BaseJob
  sidekiq_options queue: "urgent", retry: 5,
                  unique_for: 1.minute,
                  unique_until: :start
  # ...
end

Sidekiq Enterprise's unique_for rejects a duplicate enqueue at perform_async time if an identical job (same class, same args) is already in the queue. Protects against: the scheduler + user both enqueuing within the same 1-minute window, the scheduler re-enqueuing while a previous instance is still queued, retry storms.

What it doesn't protect against: back-to-back re-enqueues > unique_for apart. Needs layer 1 for that. And race conditions during perform across different instances. Needs layer 2.

Why all three, not one

No single layer covers all the cases:

Scenario L1 L2 L3
Job ran successfully; re-enqueued
Two workers race on same row
Queue has two entries for same record
Job partially ran; retry depends
Scheduler + user enqueue simultaneously

The three layers each close a different set of vulnerabilities. Skipping any of them leaves a class of production bugs open.

Cheap to expensive

Layer 1 (state check) is the cheapest — a single DB read, no locks, no framework overhead. Make every job start with it.

Layer 2 (DB lock) is the most expensive at scale — holding a row lock serialises all workers on that row. Use only around the mutation, not around the whole perform.

Layer 3 (unique jobs) is free at perform time but requires Sidekiq Enterprise (paid). Use selectively for jobs where enqueue storms are expected (e.g. status-check jobs that might be triggered from many different places).

Natural unique keys

All three layers rely on jobs having a natural unique key — usually a single DB row. Jobs dispatched with perform_async(id) where id is a stable record identifier compose well with all three layers.

Jobs with no natural unique key (e.g. send_promo_email(campaign_params)) can't be made idempotent by these three layers; they need explicit deduplication tokens (see concepts/idempotency-token).

Relationship to concepts/idempotency-token

idempotency-token is the request/operation-level construct: a per-call identifier attached at the client that the server deduplicates against. Typically used for API calls (HTTP retries, hedged reads/writes).

idempotent-job-design is the worker-level construct: how the job's perform method is written so that the same conceptual work is safe to re-execute. Often uses idempotency-token internally (e.g. a job that hits an external API uses an idempotency token on that call), but the overall job-level idempotence is about the interaction with the application database.

Seen in

Last updated · 378 distilled / 1,213 read