Skip to content

PATTERN Cited by 1 source

Jittered job scheduling

When enqueuing many jobs that share a downstream (external API, rate-limited service, shared database), attach a random delay per job drawn from [0, max_wait] so their executions spread evenly over the window instead of arriving as a coincident burst. Prevents thundering herds at the downstream.

The pattern

# caller code
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)

# framework helper (custom, on ApplicationJob)
MAX_WAIT = 1.minute
def self.perform_with_jitter(*args, **options)
  max_wait   = options[:max_wait] || MAX_WAIT
  min_wait   = options[:min_wait] || 0.seconds
  random_wait = rand(min_wait...max_wait)
  set(wait: random_wait).perform_later(*args)
end

ActiveJob's set(wait: duration) is the substrate — Sidekiq translates it into a scheduled-set entry in Redis with a run-at timestamp. The jitter helper composes random-delay generation with set.

Why jitter

A scheduler job that bulk-enqueues 10,000 CleanUpJobs produces 10,000 jobs that are all immediately runnable. Without jitter:

  • Sidekiq workers pick them up as fast as they can.
  • Whichever external API / downstream service each CleanUpJob talks to gets 10,000 requests in the same few seconds.
  • The downstream rate-limits the burst, many CleanUpJob instances fail, retries storm, system oscillates.

With jitter over 30 minutes:

  • Each job's scheduled run-at is uniformly distributed over 30 minutes.
  • Downstream sees ~333 requests/minute, well within a typical rate limit.
  • No retries, no storm, system stable.

Jitter is the outbound-side smoothing of a batch of work that the infrastructure can produce fast but the downstream can't consume fast.

The anti-pattern: perform_in(fixed_delay)

A naive attempt: CleanUpJob.perform_in(10.minutes, id) for every job. Problem: all 10,000 jobs run at now + 10 minutes, which is exactly the same thundering-herd problem, just shifted.

Jitter requires the delay to be random per job.

Shape of the jitter window

  • max_wait: upper bound of the delay. Set to the window within which all work must complete. For loosely-timed maintenance (cleanup, garbage collection, eventual consistency reconciliation), 30 minutes to hours is typical. For tighter SLOs, minutes.
  • min_wait: lower bound (usually 0). Non-zero min_wait is useful if the downstream can't handle any inbound load right now — e.g. when you're kicking off a batch right after a deploy and want to let warm-up finish first.
  • Distribution: uniform rand(min..max) is the default. Exponential, Gaussian, or bimodal distributions could be useful in specific contexts but the overwhelming common case is uniform.

Canonical implementation (PlanetScale, 2022)

From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:

"We don't want certain types of jobs all running at once because they may overwhelm an external API. In those cases, we spread them out over a time period using perform_with_jitter."

Canonical call-site: CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)"This will run sometime in the next 30 minutes."

Canonical helper placement: app/jobs/application_job.rb as a class method on ApplicationJob so every subclass can call it.

When to use it

  • Scheduler bulk-enqueues jobs that share a downstream. The canonical case. The scheduler
  • bulk enqueue + jitter triplet is how PlanetScale structures periodic maintenance.
  • Work that doesn't need to happen now. Any job whose completion window is measured in minutes-to-hours rather than seconds.
  • Upstream of a rate-limited external service. A third-party API with 100 req/s can tolerate 6,000 req/minute smoothed over a window but not 10,000 in 10 seconds.

When not to use it

  • Work with strict SLO. If a user is waiting on the job, jitter adds observable latency.
  • Causally-ordered work. If JobB must run after JobA completes, jitter will usually violate ordering. Use explicit dependency / chaining instead.
  • Work where the downstream can keep up. If the downstream handles the full burst comfortably, jitter is overhead for no benefit.
  • Low volume. 10 jobs don't need jittering; the burst is small.

Composition with other patterns

Relationship to CleanUpJob.perform_in(random) naively

Naive pre-jitter patterns write their own rand(0..30.minutes) at the call-site. The helper perform_with_jitter(id, max_wait: 30.minutes) is the principled version: one place defines the jitter semantics, every caller opts in by name, operators can grep call-sites for audit.

Seen in

Last updated · 378 distilled / 1,213 read