PATTERN

Paired scheduler–reconciler¶

For every async job FooJob, install a companion scheduler job ScheduleFooJobs that runs periodically, queries the authoritative database for records implying FooJob should run, and enqueues one FooJob per such record. The user-triggered enqueue is an optimisation for low latency; the scheduler is the correctness guarantee — it reconstructs the queue from database truth if any user enqueue is dropped, the Redis store is flushed, or the queue data is lost entirely.

The pattern¶

class ScheduleDatabaseJobs < BaseJob
  sidekiq_options queue: "background"

  def perform
    Database.pending.find_each do |database|
      DatabaseCreationJob.perform_async(database.id)
    end
  end
end

Run this every minute (via sidekiq-cron or equivalent). Paired with DatabaseCreationJob.perform_async(database.id) called synchronously from the user-triggered code path at create-a- database time. Either path — direct user enqueue or scheduler re-enqueue — produces the same outcome; duplicates are handled by idempotent job design.

Shape¶

Pair: one scheduler job per work job.
Cadence: scheduler runs on a cron (every minute, every 5 minutes). Cadence tunable per pair based on latency-sensitivity.
Derivable work query: scheduler's query is over the authoritative database state. The database answer to "what work needs doing?" is authoritative; the queue's answer is advisory.
Idempotent work job: every FooJob must tolerate being run multiple times for the same record. Composed via early-exit + DB lock + framework unique-jobs.

Why paired, not single¶

One naive alternative: skip the user-triggered enqueue and rely only on the scheduler. Two problems:

Scheduler-tick latency floor: a user creating a database waits until the next scheduler tick (up to the tick interval) before anything happens. 60 seconds of "nothing" is unacceptable UX.
Scheduler load amplification: scheduler queries every interval scale with total pending work, not delta. Without user-enqueue on the common path, the scheduler becomes the hot path.

The pair gives low latency on the common path (user-triggered) and correctness under data loss (scheduler).

Why not retry-only¶

Another naive alternative: just rely on Sidekiq's automatic retry for failed jobs. Problems:

Retry doesn't cover dropped-before-enqueue. If perform_async is never reached (network failure, bug, deploy race), nothing enqueues and nothing retries.
Retry doesn't cover queue-data-loss. If Redis is flushed (operator mistake, corruption, failover), retry records are lost with everything else.

The scheduler is enqueue-insurance, not execution- insurance.

Canonical implementation (PlanetScale, 2022)¶

From :

Create-a-database flow:
User action creates databases row with state: pending.
Synchronous DatabaseCreationJob.perform_async(id) for fast path.
ScheduleDatabaseJobs runs every minute, finds all Database.pending, re-enqueues each.
Backup flow:
BackupPolicy rows exist for each database (row- of-record for should this be backed up?).
ScheduleBackupJobs runs every 5 minutes: BackupPolicy.needs_to_run.in_batches { |batch| BackupJob.perform_bulk(batch.pluck(:id)) }.
No user-triggered enqueue for backups — this is a pure scheduler-only case where latency floor is acceptable.

"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."

When to use it¶

Work is derivable from persistent state. If "what jobs should exist right now" is computable from a database query, you can schedule them.
Data loss in the queue is a real failure mode. Redis flush, Redis corruption, operator mistake, misconfigured pool — these are all solvable by the scheduler reconstructing the queue.
Work is idempotent or can be made so. Early-exit
DB lock + framework dedup cover the common cases (see concepts/idempotent-job-design).
Tick cadence meets latency SLO. If the pair is scheduler-only and 60s latency is acceptable, do scheduler-only. If not, pair with user-triggered enqueue.

When not to use it¶

Work isn't derivable from state. If the work is "send this one-off email right now," there's no row that says the email should exist; the scheduler has nothing to check.
Jobs can't be made idempotent. If FooJob's effect is non-idempotent and can't be protected (no natural unique key, no DB transaction boundary), the scheduler's re-enqueue will produce duplicates that actually matter.
Database can't absorb the scheduler query load. Database.pending.find_each every minute with no index on state will become a table scan. The scheduler requires an index that supports its query.
State is distributed across multiple data stores. The scheduler query is "select work from one authoritative place." If state lives in Redis + MySQL + an external API, the scheduler can't issue a single query to reconstruct work.

Composition with other patterns¶

Bulk enqueue — the scheduler bulk-enqueues its findings via perform_bulk to amortise Redis round-trips. At 10,000 rows, one Redis command per batch instead of one per job.
Jittered scheduling — when the scheduler bulk-enqueues N jobs that hit the same external API, jitter spreads execution over a window to avoid thundering herd.
Feature-flagged enqueue rejection — the scheduler itself is a Sidekiq job; Flipper-gated middleware can disable it during an incident.
Idempotent jobs — every paired work job must be idempotent; scheduler re-enqueue + user re-trigger + Sidekiq retry can all fire simultaneously.

Relationship to Kubernetes / control-loop reconcilers¶

Kubernetes controllers are the same shape at a different altitude:

Kubernetes controller: reads desired state (spec), compares to observed state (status), acts to close the gap. Runs continuously on the controller manager's reconcile loop.
Scheduler-reconciler: reads desired state (DB rows implying work), compares to observed state (queue contents, implicitly), acts by enqueueing. Runs on a fixed cadence.

The controller pattern is what's happening; the scheduler-reconciler is the async-job-framework idiom of it.

Failure modes¶

Scheduler stops running. The system's self-healing property depends on the scheduler ticking. If Sidekiq-cron misconfigures, or if a deploy breaks the scheduler job, or if an operator uses the kill-switch to disable the scheduler and forgets, nothing re-enqueues. There's no meta-scheduler for the scheduler. Mitigation: monitor "time since last scheduler tick" as an SLI.
Scheduler query misses pending work. If the work-derivation query has a bug (e.g. doesn't include state: 'partially_pending' in its condition), no scheduler tick will ever enqueue that work. Mitigation: paint the query's coverage into a QA contract.
Scheduler falls behind. At very high volume, find_each-in-one-minute may not finish before the next tick. Overlap is usually harmless (thanks to idempotence) but each tick's cost grows with backlog. Mitigation: bulk enqueue reduces per-tick cost; horizontal scheduler sharding.

Seen in¶

— canonical wiki introduction. PlanetScale's Rails control plane runs paired scheduler jobs for database creation, branch creation, schema deployment, and backups. The authoritative state is in MySQL rows; the scheduler reconciles the Redis-backed Sidekiq queue against that state. "As long as the scheduler job is running, we could dump our entire queue and still recover."