PATTERN Cited by 1 source
Jittered job scheduling¶
When enqueuing many jobs that share a downstream (external
API, rate-limited service, shared database), attach a
random delay per job drawn from [0, max_wait] so
their executions spread evenly over the window instead of
arriving as a coincident burst. Prevents
thundering herds at the
downstream.
The pattern¶
# caller code
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)
# framework helper (custom, on ApplicationJob)
MAX_WAIT = 1.minute
def self.perform_with_jitter(*args, **options)
max_wait = options[:max_wait] || MAX_WAIT
min_wait = options[:min_wait] || 0.seconds
random_wait = rand(min_wait...max_wait)
set(wait: random_wait).perform_later(*args)
end
ActiveJob's set(wait: duration) is the substrate — Sidekiq
translates it into a scheduled-set entry in Redis with a
run-at timestamp. The jitter helper composes random-delay
generation with set.
Why jitter¶
A
scheduler job
that bulk-enqueues 10,000 CleanUpJobs produces
10,000 jobs that are all immediately runnable. Without
jitter:
- Sidekiq workers pick them up as fast as they can.
- Whichever external API / downstream service each
CleanUpJobtalks to gets 10,000 requests in the same few seconds. - The downstream rate-limits the burst, many
CleanUpJobinstances fail, retries storm, system oscillates.
With jitter over 30 minutes:
- Each job's scheduled run-at is uniformly distributed over 30 minutes.
- Downstream sees ~333 requests/minute, well within a typical rate limit.
- No retries, no storm, system stable.
Jitter is the outbound-side smoothing of a batch of work that the infrastructure can produce fast but the downstream can't consume fast.
The anti-pattern: perform_in(fixed_delay)¶
A naive attempt: CleanUpJob.perform_in(10.minutes, id)
for every job. Problem: all 10,000 jobs run at
now + 10 minutes, which is exactly the same
thundering-herd problem, just shifted.
Jitter requires the delay to be random per job.
Shape of the jitter window¶
max_wait: upper bound of the delay. Set to the window within which all work must complete. For loosely-timed maintenance (cleanup, garbage collection, eventual consistency reconciliation), 30 minutes to hours is typical. For tighter SLOs, minutes.min_wait: lower bound (usually 0). Non-zero min_wait is useful if the downstream can't handle any inbound load right now — e.g. when you're kicking off a batch right after a deploy and want to let warm-up finish first.- Distribution: uniform
rand(min..max)is the default. Exponential, Gaussian, or bimodal distributions could be useful in specific contexts but the overwhelming common case is uniform.
Canonical implementation (PlanetScale, 2022)¶
From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:
"We don't want certain types of jobs all running at
once because they may overwhelm an external API. In
those cases, we spread them out over a time period
using perform_with_jitter."
Canonical call-site: CleanUpJob.perform_with_jitter(id,
max_wait: 30.minutes) — "This will run sometime in the
next 30 minutes."
Canonical helper placement: app/jobs/application_job.rb
as a class method on ApplicationJob so every subclass
can call it.
When to use it¶
- Scheduler bulk-enqueues jobs that share a downstream. The canonical case. The scheduler
- bulk enqueue + jitter triplet is how PlanetScale structures periodic maintenance.
- Work that doesn't need to happen now. Any job whose completion window is measured in minutes-to-hours rather than seconds.
- Upstream of a rate-limited external service. A third-party API with 100 req/s can tolerate 6,000 req/minute smoothed over a window but not 10,000 in 10 seconds.
When not to use it¶
- Work with strict SLO. If a user is waiting on the job, jitter adds observable latency.
- Causally-ordered work. If
JobBmust run afterJobAcompletes, jitter will usually violate ordering. Use explicit dependency / chaining instead. - Work where the downstream can keep up. If the downstream handles the full burst comfortably, jitter is overhead for no benefit.
- Low volume. 10 jobs don't need jittering; the burst is small.
Composition with other patterns¶
- Paired scheduler–reconciler — the canonical upstream; the scheduler finds work, bulk-enqueues with jitter.
- Bulk enqueue — the write side; jitter controls the read/execute side.
- TTL- based deletion with jitter — same jitter concept at the cache-eviction altitude.
Relationship to CleanUpJob.perform_in(random) naively¶
Naive pre-jitter patterns write their own
rand(0..30.minutes) at the call-site. The helper
perform_with_jitter(id, max_wait: 30.minutes) is the
principled version: one place defines the jitter
semantics, every caller opts in by name, operators can
grep call-sites for audit.
Seen in¶
- sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing —
canonical wiki introduction. PlanetScale's Rails
backend defines
perform_with_jitteras anApplicationJobclass method, canonical call-siteCleanUpJob.perform_with_jitter(id, max_wait: 30.minutes). Motivation: bulk-enqueued jobs calling external APIs must not burst.