Skip to content

PATTERN Cited by 1 source

Paired scheduler–reconciler

For every async job FooJob, install a companion scheduler job ScheduleFooJobs that runs periodically, queries the authoritative database for records implying FooJob should run, and enqueues one FooJob per such record. The user-triggered enqueue is an optimisation for low latency; the scheduler is the correctness guarantee — it reconstructs the queue from database truth if any user enqueue is dropped, the Redis store is flushed, or the queue data is lost entirely.

The pattern

class ScheduleDatabaseJobs < BaseJob
  sidekiq_options queue: "background"

  def perform
    Database.pending.find_each do |database|
      DatabaseCreationJob.perform_async(database.id)
    end
  end
end

Run this every minute (via sidekiq-cron or equivalent). Paired with DatabaseCreationJob.perform_async(database.id) called synchronously from the user-triggered code path at create-a- database time. Either path — direct user enqueue or scheduler re-enqueue — produces the same outcome; duplicates are handled by idempotent job design.

Shape

  • Pair: one scheduler job per work job.
  • Cadence: scheduler runs on a cron (every minute, every 5 minutes). Cadence tunable per pair based on latency-sensitivity.
  • Derivable work query: scheduler's query is over the authoritative database state. The database answer to "what work needs doing?" is authoritative; the queue's answer is advisory.
  • Idempotent work job: every FooJob must tolerate being run multiple times for the same record. Composed via early-exit + DB lock + framework unique-jobs.

Why paired, not single

One naive alternative: skip the user-triggered enqueue and rely only on the scheduler. Two problems:

  1. Scheduler-tick latency floor: a user creating a database waits until the next scheduler tick (up to the tick interval) before anything happens. 60 seconds of "nothing" is unacceptable UX.
  2. Scheduler load amplification: scheduler queries every interval scale with total pending work, not delta. Without user-enqueue on the common path, the scheduler becomes the hot path.

The pair gives low latency on the common path (user-triggered) and correctness under data loss (scheduler).

Why not retry-only

Another naive alternative: just rely on Sidekiq's automatic retry for failed jobs. Problems:

  • Retry doesn't cover dropped-before-enqueue. If perform_async is never reached (network failure, bug, deploy race), nothing enqueues and nothing retries.
  • Retry doesn't cover queue-data-loss. If Redis is flushed (operator mistake, corruption, failover), retry records are lost with everything else.

The scheduler is enqueue-insurance, not execution- insurance.

Canonical implementation (PlanetScale, 2022)

From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:

  • Create-a-database flow:
  • User action creates databases row with state: pending.
  • Synchronous DatabaseCreationJob.perform_async(id) for fast path.
  • ScheduleDatabaseJobs runs every minute, finds all Database.pending, re-enqueues each.
  • Backup flow:
  • BackupPolicy rows exist for each database (row- of-record for should this be backed up?).
  • ScheduleBackupJobs runs every 5 minutes: BackupPolicy.needs_to_run.in_batches { |batch| BackupJob.perform_bulk(batch.pluck(:id)) }.
  • No user-triggered enqueue for backups — this is a pure scheduler-only case where latency floor is acceptable.

"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."

When to use it

  • Work is derivable from persistent state. If "what jobs should exist right now" is computable from a database query, you can schedule them.
  • Data loss in the queue is a real failure mode. Redis flush, Redis corruption, operator mistake, misconfigured pool — these are all solvable by the scheduler reconstructing the queue.
  • Work is idempotent or can be made so. Early-exit
  • DB lock + framework dedup cover the common cases (see concepts/idempotent-job-design).
  • Tick cadence meets latency SLO. If the pair is scheduler-only and 60s latency is acceptable, do scheduler-only. If not, pair with user-triggered enqueue.

When not to use it

  • Work isn't derivable from state. If the work is "send this one-off email right now," there's no row that says the email should exist; the scheduler has nothing to check.
  • Jobs can't be made idempotent. If FooJob's effect is non-idempotent and can't be protected (no natural unique key, no DB transaction boundary), the scheduler's re-enqueue will produce duplicates that actually matter.
  • Database can't absorb the scheduler query load. Database.pending.find_each every minute with no index on state will become a table scan. The scheduler requires an index that supports its query.
  • State is distributed across multiple data stores. The scheduler query is "select work from one authoritative place." If state lives in Redis + MySQL + an external API, the scheduler can't issue a single query to reconstruct work.

Composition with other patterns

  • Bulk enqueue — the scheduler bulk-enqueues its findings via perform_bulk to amortise Redis round-trips. At 10,000 rows, one Redis command per batch instead of one per job.
  • Jittered scheduling — when the scheduler bulk-enqueues N jobs that hit the same external API, jitter spreads execution over a window to avoid thundering herd.
  • Feature-flagged enqueue rejection — the scheduler itself is a Sidekiq job; Flipper-gated middleware can disable it during an incident.
  • Idempotent jobs — every paired work job must be idempotent; scheduler re-enqueue + user re-trigger + Sidekiq retry can all fire simultaneously.

Relationship to Kubernetes / control-loop reconcilers

Kubernetes controllers are the same shape at a different altitude:

  • Kubernetes controller: reads desired state (spec), compares to observed state (status), acts to close the gap. Runs continuously on the controller manager's reconcile loop.
  • Scheduler-reconciler: reads desired state (DB rows implying work), compares to observed state (queue contents, implicitly), acts by enqueueing. Runs on a fixed cadence.

The controller pattern is what's happening; the scheduler-reconciler is the async-job-framework idiom of it.

Failure modes

  • Scheduler stops running. The system's self-healing property depends on the scheduler ticking. If Sidekiq-cron misconfigures, or if a deploy breaks the scheduler job, or if an operator uses the kill-switch to disable the scheduler and forgets, nothing re-enqueues. There's no meta-scheduler for the scheduler. Mitigation: monitor "time since last scheduler tick" as an SLI.
  • Scheduler query misses pending work. If the work-derivation query has a bug (e.g. doesn't include state: 'partially_pending' in its condition), no scheduler tick will ever enqueue that work. Mitigation: paint the query's coverage into a QA contract.
  • Scheduler falls behind. At very high volume, find_each-in-one-minute may not finish before the next tick. Overlap is usually harmless (thanks to idempotence) but each tick's cost grows with backlog. Mitigation: bulk enqueue reduces per-tick cost; horizontal scheduler sharding.

Seen in

  • sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healingcanonical wiki introduction. PlanetScale's Rails control plane runs paired scheduler jobs for database creation, branch creation, schema deployment, and backups. The authoritative state is in MySQL rows; the scheduler reconciles the Redis-backed Sidekiq queue against that state. "As long as the scheduler job is running, we could dump our entire queue and still recover."
Last updated · 378 distilled / 1,213 read