PATTERN Cited by 1 source
Paired scheduler–reconciler¶
For every async job FooJob, install a companion scheduler
job ScheduleFooJobs that runs periodically, queries the
authoritative database for records implying FooJob
should run, and enqueues one FooJob per such record. The
user-triggered enqueue is an optimisation for low latency;
the scheduler is the correctness guarantee — it
reconstructs the queue from database truth if any user
enqueue is dropped, the Redis store is flushed, or the
queue data is lost entirely.
The pattern¶
class ScheduleDatabaseJobs < BaseJob
sidekiq_options queue: "background"
def perform
Database.pending.find_each do |database|
DatabaseCreationJob.perform_async(database.id)
end
end
end
Run this every minute (via sidekiq-cron or equivalent). Paired
with DatabaseCreationJob.perform_async(database.id) called
synchronously from the user-triggered code path at create-a-
database time. Either path — direct user enqueue or scheduler
re-enqueue — produces the same outcome; duplicates are handled
by idempotent job design.
Shape¶
- Pair: one scheduler job per work job.
- Cadence: scheduler runs on a cron (every minute, every 5 minutes). Cadence tunable per pair based on latency-sensitivity.
- Derivable work query: scheduler's query is over the authoritative database state. The database answer to "what work needs doing?" is authoritative; the queue's answer is advisory.
- Idempotent work job: every
FooJobmust tolerate being run multiple times for the same record. Composed via early-exit + DB lock + framework unique-jobs.
Why paired, not single¶
One naive alternative: skip the user-triggered enqueue and rely only on the scheduler. Two problems:
- Scheduler-tick latency floor: a user creating a database waits until the next scheduler tick (up to the tick interval) before anything happens. 60 seconds of "nothing" is unacceptable UX.
- Scheduler load amplification: scheduler queries every interval scale with total pending work, not delta. Without user-enqueue on the common path, the scheduler becomes the hot path.
The pair gives low latency on the common path (user-triggered) and correctness under data loss (scheduler).
Why not retry-only¶
Another naive alternative: just rely on Sidekiq's automatic retry for failed jobs. Problems:
- Retry doesn't cover dropped-before-enqueue. If
perform_asyncis never reached (network failure, bug, deploy race), nothing enqueues and nothing retries. - Retry doesn't cover queue-data-loss. If Redis is flushed (operator mistake, corruption, failover), retry records are lost with everything else.
The scheduler is enqueue-insurance, not execution- insurance.
Canonical implementation (PlanetScale, 2022)¶
From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:
- Create-a-database flow:
- User action creates
databasesrow withstate: pending. - Synchronous
DatabaseCreationJob.perform_async(id)for fast path. ScheduleDatabaseJobsruns every minute, finds allDatabase.pending, re-enqueues each.- Backup flow:
BackupPolicyrows exist for each database (row- of-record for should this be backed up?).ScheduleBackupJobsruns every 5 minutes:BackupPolicy.needs_to_run.in_batches { |batch| BackupJob.perform_bulk(batch.pluck(:id)) }.- No user-triggered enqueue for backups — this is a pure scheduler-only case where latency floor is acceptable.
"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."
When to use it¶
- Work is derivable from persistent state. If "what jobs should exist right now" is computable from a database query, you can schedule them.
- Data loss in the queue is a real failure mode. Redis flush, Redis corruption, operator mistake, misconfigured pool — these are all solvable by the scheduler reconstructing the queue.
- Work is idempotent or can be made so. Early-exit
- DB lock + framework dedup cover the common cases (see concepts/idempotent-job-design).
- Tick cadence meets latency SLO. If the pair is scheduler-only and 60s latency is acceptable, do scheduler-only. If not, pair with user-triggered enqueue.
When not to use it¶
- Work isn't derivable from state. If the work is "send this one-off email right now," there's no row that says the email should exist; the scheduler has nothing to check.
- Jobs can't be made idempotent. If
FooJob's effect is non-idempotent and can't be protected (no natural unique key, no DB transaction boundary), the scheduler's re-enqueue will produce duplicates that actually matter. - Database can't absorb the scheduler query load.
Database.pending.find_eachevery minute with no index onstatewill become a table scan. The scheduler requires an index that supports its query. - State is distributed across multiple data stores. The scheduler query is "select work from one authoritative place." If state lives in Redis + MySQL + an external API, the scheduler can't issue a single query to reconstruct work.
Composition with other patterns¶
- Bulk enqueue — the
scheduler bulk-enqueues its findings via
perform_bulkto amortise Redis round-trips. At 10,000 rows, one Redis command per batch instead of one per job. - Jittered scheduling — when the scheduler bulk-enqueues N jobs that hit the same external API, jitter spreads execution over a window to avoid thundering herd.
- Feature-flagged enqueue rejection — the scheduler itself is a Sidekiq job; Flipper-gated middleware can disable it during an incident.
- Idempotent jobs — every paired work job must be idempotent; scheduler re-enqueue + user re-trigger + Sidekiq retry can all fire simultaneously.
Relationship to Kubernetes / control-loop reconcilers¶
Kubernetes controllers are the same shape at a different altitude:
- Kubernetes controller: reads desired state (spec), compares to observed state (status), acts to close the gap. Runs continuously on the controller manager's reconcile loop.
- Scheduler-reconciler: reads desired state (DB rows implying work), compares to observed state (queue contents, implicitly), acts by enqueueing. Runs on a fixed cadence.
The controller pattern is what's happening; the scheduler-reconciler is the async-job-framework idiom of it.
Failure modes¶
- Scheduler stops running. The system's self-healing property depends on the scheduler ticking. If Sidekiq-cron misconfigures, or if a deploy breaks the scheduler job, or if an operator uses the kill-switch to disable the scheduler and forgets, nothing re-enqueues. There's no meta-scheduler for the scheduler. Mitigation: monitor "time since last scheduler tick" as an SLI.
- Scheduler query misses pending work. If
the work-derivation query has a bug (e.g.
doesn't include
state: 'partially_pending'in its condition), no scheduler tick will ever enqueue that work. Mitigation: paint the query's coverage into a QA contract. - Scheduler falls behind. At very high
volume,
find_each-in-one-minute may not finish before the next tick. Overlap is usually harmless (thanks to idempotence) but each tick's cost grows with backlog. Mitigation: bulk enqueue reduces per-tick cost; horizontal scheduler sharding.
Seen in¶
- sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing — canonical wiki introduction. PlanetScale's Rails control plane runs paired scheduler jobs for database creation, branch creation, schema deployment, and backups. The authoritative state is in MySQL rows; the scheduler reconciles the Redis-backed Sidekiq queue against that state. "As long as the scheduler job is running, we could dump our entire queue and still recover."