PATTERN

Jittered job scheduling¶

When enqueuing many jobs that share a downstream (external API, rate-limited service, shared database), attach a random delay per job drawn from [0, max_wait] so their executions spread evenly over the window instead of arriving as a coincident burst. Prevents thundering herds at the downstream.

The pattern¶

# caller code
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)

# framework helper (custom, on ApplicationJob)
MAX_WAIT = 1.minute
def self.perform_with_jitter(*args, **options)
  max_wait   = options[:max_wait] || MAX_WAIT
  min_wait   = options[:min_wait] || 0.seconds
  random_wait = rand(min_wait...max_wait)
  set(wait: random_wait).perform_later(*args)
end

ActiveJob's set(wait: duration) is the substrate — Sidekiq translates it into a scheduled-set entry in Redis with a run-at timestamp. The jitter helper composes random-delay generation with set.

Why jitter¶

A scheduler job that bulk-enqueues 10,000 CleanUpJobs produces 10,000 jobs that are all immediately runnable. Without jitter:

Sidekiq workers pick them up as fast as they can.
Whichever external API / downstream service each CleanUpJob talks to gets 10,000 requests in the same few seconds.
The downstream rate-limits the burst, many CleanUpJob instances fail, retries storm, system oscillates.

With jitter over 30 minutes:

Each job's scheduled run-at is uniformly distributed over 30 minutes.
Downstream sees ~333 requests/minute, well within a typical rate limit.
No retries, no storm, system stable.

Jitter is the outbound-side smoothing of a batch of work that the infrastructure can produce fast but the downstream can't consume fast.

The anti-pattern: `perform_in(fixed_delay)`¶

A naive attempt: CleanUpJob.perform_in(10.minutes, id) for every job. Problem: all 10,000 jobs run at now + 10 minutes, which is exactly the same thundering-herd problem, just shifted.

Jitter requires the delay to be random per job.

Shape of the jitter window¶

max_wait: upper bound of the delay. Set to the window within which all work must complete. For loosely-timed maintenance (cleanup, garbage collection, eventual consistency reconciliation), 30 minutes to hours is typical. For tighter SLOs, minutes.
min_wait: lower bound (usually 0). Non-zero min_wait is useful if the downstream can't handle any inbound load right now — e.g. when you're kicking off a batch right after a deploy and want to let warm-up finish first.
Distribution: uniform rand(min..max) is the default. Exponential, Gaussian, or bimodal distributions could be useful in specific contexts but the overwhelming common case is uniform.

Canonical implementation (PlanetScale, 2022)¶

From :

"We don't want certain types of jobs all running at once because they may overwhelm an external API. In those cases, we spread them out over a time period using perform_with_jitter."

Canonical call-site: CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes) — "This will run sometime in the next 30 minutes."

Canonical helper placement: app/jobs/application_job.rb as a class method on ApplicationJob so every subclass can call it.

When to use it¶

Scheduler bulk-enqueues jobs that share a downstream. The canonical case. The scheduler
bulk enqueue + jitter triplet is how PlanetScale structures periodic maintenance.
Work that doesn't need to happen now. Any job whose completion window is measured in minutes-to-hours rather than seconds.
Upstream of a rate-limited external service. A third-party API with 100 req/s can tolerate 6,000 req/minute smoothed over a window but not 10,000 in 10 seconds.

When not to use it¶

Work with strict SLO. If a user is waiting on the job, jitter adds observable latency.
Causally-ordered work. If JobB must run after JobA completes, jitter will usually violate ordering. Use explicit dependency / chaining instead.
Work where the downstream can keep up. If the downstream handles the full burst comfortably, jitter is overhead for no benefit.
Low volume. 10 jobs don't need jittering; the burst is small.

Composition with other patterns¶

Paired scheduler–reconciler — the canonical upstream; the scheduler finds work, bulk-enqueues with jitter.
Bulk enqueue — the write side; jitter controls the read/execute side.
TTL- based deletion with jitter — same jitter concept at the cache-eviction altitude.

Relationship to `CleanUpJob.perform_in(random)` naively¶

Naive pre-jitter patterns write their own rand(0..30.minutes) at the call-site. The helper perform_with_jitter(id, max_wait: 30.minutes) is the principled version: one place defines the jitter semantics, every caller opts in by name, operators can grep call-sites for audit.

Seen in¶

— canonical wiki introduction. PlanetScale's Rails backend defines perform_with_jitter as an ApplicationJob class method, canonical call-site CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes). Motivation: bulk-enqueued jobs calling external APIs must not burst.