PlanetScale — How we made PlanetScale's background jobs self-healing¶
Mike Coutermarsh's third wiki ingest (companion to his 2022-01-18 Rails-CI post and 2024-04-04 schema-change-workflow post) canonicalises PlanetScale's core design discipline for background jobs in the application tier: the job queue is derivable from authoritative database state, not the source of truth. A paired scheduler job runs periodically and re-enqueues any work the database says still needs doing, so the system recovers automatically if the entire Redis queue is lost.
Summary¶
PlanetScale's Rails API runs Sidekiq jobs for critical user-facing actions — database creation, branch creation, schema deployment, backups. Coutermarsh names two hard requirements set at system design time:
- "If we lose all data in the queues at any time, we can recover without any loss in functionality."
- "If a single job fails, it will be automatically re-run."
The motivating framing is "not if, but when" — Redis data loss, dropped jobs, and silent failures will happen; the architecture must treat them as expected events rather than incidents.
The central architectural primitive: for each job, there is another job whose responsibility is to schedule it to run. The "scheduler" job runs on a cron-like cadence (every minute, every 5 minutes) and queries the application's authoritative database (MySQL, via PlanetScale's own product) for records in a state that implies work is pending. For each such record, it enqueues the work job. Because the check is from database state — not from queue state — flushing the queue is non-destructive: the next scheduler run reconstitutes it.
"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."
The post walks through four complementary mechanisms:
- Paired scheduler jobs that derive work from DB state (the core pattern).
- Bulk enqueue via
perform_bulkto avoid Redis-round-trip amplification at scale. - Jittered scheduling
via
perform_with_jitterto spread out bursts of simultaneous jobs hitting external APIs. - Three ways to handle the duplicate-job problem that
paired-scheduler composition introduces: (a) early-exit on
DB state re-check, (b) DB row locks around
non-idempotent mutations, (c) Sidekiq Enterprise
unique_for:at the framework layer.
The post also reuses the already-canonical Flipper-gated middleware from Elom Gomez's sibling 2022 post, here to disable scheduler jobs themselves during an incident without a deploy.
Key takeaways¶
- State is authoritative in the database, not the job
queue. When a user triggers "create database," the
application synchronously writes a
databasesrow withstate: pendingand enqueuesDatabaseCreationJob. Whichever arrives first doesn't matter — the row is the source of truth. The queue is an optimisation for low-latency execution, not a state-of-record. This inverts the common "queue is the integration point" mental model: the queue here is a derivative. (canonicalised as concepts/state-in-database-not-queue). - Paired scheduler jobs reconstruct the queue from DB
state. For every
FooJob, there's aScheduleFooJobsjob that runs periodically and scans the DB for records implyingFooJobshould run. This is a reconciler pattern applied to async jobs — the scheduler brings observed state (queued jobs) closer to desired state (DB-implied pending work) on each tick. "As long as the scheduler job is running, we could dump our entire queue and still recover." - Idempotence is a load-bearing assumption of the
pattern. Because any job can re-fire (user-triggered
enqueue + scheduler re-enqueue + retry), every job's
first operation is a state-re-check: "store state in our
database and quickly exit a job if it no longer needs to
be run." Non-idempotent subsections are protected with
backup.with_lock { ... }database row locks. Sidekiq Enterprise'sunique_for: 1.minute, unique_until: :startprovides a framework-layer guard against duplicate enqueues. Three layers of protection, each for a different failure class. - Bulk enqueue matters at scale. At low volume
("thousands"), enqueuing jobs one at a time works. Past
that, "we started spending a lot of time sending
individual Redis requests" — and they moved to
BackupJob.perform_bulk(backup_policies.pluck(:id))with anin_batches do |backup_policies|loop, batches of 1,000. One Redis command per batch, not one per job. An O(N)-to-Redis pattern becomes O(N/1000). - Jitter prevents downstream thundering herds.
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)is a custom helper:set(wait: rand(min_wait..max_wait)) .perform_later(*args). When the scheduler enqueues 10,000CleanUpJobs in bulk, without jitter they all hit Redis and then the workers and then the downstream external API in the same ~1-second window. Spreading them over 30 minutes keeps each recipient's rate within its limit. - Duplicate-job handling has three compositional layers. Each layer solves a distinct failure mode:
return unless user.pending?— idempotence at application level (cheapest).backup.with_lock do ... end— database-level mutex for mutation (correct under concurrent workers).sidekiq_options unique_for: 1.minute— framework- level deduplication (prevents enqueue storm, not execution overlap).- Scheduler jobs are themselves Sidekiq jobs, gated by
the Flipper kill-switch. The
SCHEDULED_JOBS.key?(klass)middleware check from Gomez's sibling post disables scheduler jobs on demand: "added the ability to stop our scheduled jobs from running at any time. This has come in useful during an incident where we've wanted control over a specific job type." The scheduler itself is subject to the same deploy-less operational lever as any other job.
Architectural numbers and primitives¶
- Tick cadence: worked example uses "every minute"
(
ScheduleDatabaseJobs) and "every 5 minutes" (ScheduleBackupJobs). Cadence is a tunable per scheduler — higher for user-facing create paths, lower for periodic maintenance. - Batch size: 1,000 jobs per
perform_bulkcall, illustrated within_batchescursor iteration. - Jitter window example: 30 minutes
max_wait; default in theMAX_WAITconstant is 1 minute. - Unique-for example:
unique_for: 1.minute, unique_until: :startonCheckDeploymentStatusJob— deduplication holds until the job starts, not until it finishes, so long-running execution doesn't block the next legitimate enqueue. - Queue used: scheduler jobs run on queue
"background"— isolated from latency-sensitive queues so scheduler latency doesn't compete with user-triggered work.
Systems¶
- Sidekiq — the background-job
substrate; the post uses its
perform_async,perform_bulk,perform_later/set(wait:), client middleware, and Sidekiq Enterpriseunique_forAPIs. - Redis — Sidekiq's queue substrate; canonicalised here as an ephemeral optimisation, not the state of record. The whole post is a discipline for tolerating Redis data loss.
- Ruby on Rails — the framework;
ActiveJob'sperform_later(*args)API, ActiveRecordwith_lock, andin_batchescursor iteration are used. - Flipper — referenced via Gomez's sibling post for the kill-switch middleware that can disable scheduler jobs.
Concepts¶
- State in database,
not queue — the foundational invariant: the database row
(
databases.state = 'pending') is the source of truth; the queued job is a derivative that can be reconstructed. - Self-healing job queue — the overall system property: the queue recovers from total data loss via re-derivation from the authoritative store. Named explicitly in the post title.
- Idempotent job design — the caller-side discipline that makes paired-scheduler composition safe: every job's first operation checks whether the work still needs doing.
- Sidekiq unique jobs —
Sidekiq Enterprise feature;
unique_for+unique_untilframework-level dedup on the enqueue path. - Thundering herd — the
failure mode
perform_with_jitterprevents when the scheduler bulk-enqueues many jobs that all call the same external API.
Patterns¶
- Paired scheduler–reconciler — canonical new wiki pattern introduced by this ingest. For each user-triggered job, install a scheduled job that queries authoritative DB state for pending work and re-enqueues. Reconciler shape applied to async-job frameworks.
- Bulk job enqueue —
amortise Redis round-trips with a single
perform_bulk([ids...])call over batches of N (e.g. 1,000). Reduces scheduler-tick cost from O(N) to O(N/batch). - Jittered job
scheduling — when bulk-enqueuing, spread execution
over a window
rand(0..max_wait)to avoid thundering herd at downstream recipients. - Feature-flagged job-enqueue rejection — sibling pattern used here to disable scheduler jobs on demand.
Operational numbers¶
None disclosed. No incident retrospective counts, no volume numbers for PlanetScale's actual Sidekiq queues, no latency / throughput measurements on scheduler ticks, no quantification of the "couple of times" Redis-dump recoveries named in the text.
Caveats¶
- Schedule cadence vs user-perceived latency. A 1-minute scheduler tick means a user-triggered enqueue that races a Redis flush has up to 60 seconds of observable "nothing happening" before the scheduler re-enqueues. For create-a- database paths this is in the acceptable range; for tighter SLOs it isn't. The post doesn't discuss the SLO axis.
- Scheduler-job cost is always-on, even when there's
nothing to do.
Database.pending.find_eachis issued every minute regardless of whether any databases are pending. At scale, the always-on cost of the scheduler tier may exceed the cost of the work it dispatches; the post doesn't address this. - DB queries must be indexed for the state column.
Database.pendingimplies a scan overdatabases WHERE state = 'pending'. At scale this must be index-backed, otherwise the scheduler becomes a recurring table scan. Not discussed. - Idempotence is a burden on every job author. The three-layer dedup composition (early-exit, DB lock, unique-for) shifts complexity from the infra tier to every job implementation. The post presents this as uncontroversial but at scale the discipline is load- bearing and easily violated.
- Works only when state of record is centralised. PlanetScale's model has one authoritative database (their own control-plane MySQL). Applications with work spread across multiple data stores can't write the equivalent scheduler query without first consolidating state.
- No discussion of scheduler-of-scheduler failure modes. If the scheduler itself fails (Sidekiq cluster outage, Sidekiq-cron misconfig, deploy bug), nothing re-enqueues. The system's self-healing property depends on the scheduler running; there's no meta-scheduler for the scheduler. The kill switch can disable a scheduler, but there's no paired mechanism to detect a stopped scheduler.
- 2022-era Sidekiq APIs.
perform_bulkis 6.x+,unique_foris Sidekiq Enterprise (paid). Shape is durable; APIs have evolved. - Application-tier detail, not PlanetScale product internals. Same framing caveat as Gomez's sibling post — this is how PlanetScale's Rails control plane operates, orthogonal to their MySQL/Vitess product.
Source¶
- Original: https://planetscale.com/blog/how-we-made-planetscale-background-jobs-self-healing-with-sidekiq
- Raw markdown:
raw/planetscale/2026-04-21-how-we-made-planetscales-background-jobs-self-healing-d6ead500.md
Related¶
- systems/sidekiq
- systems/redis
- systems/ruby-on-rails
- systems/flipper-gem
- concepts/self-healing-job-queue
- concepts/state-in-database-not-queue
- concepts/idempotent-job-design
- concepts/sidekiq-unique-jobs
- concepts/thundering-herd
- patterns/paired-scheduler-reconciler
- patterns/bulk-job-enqueue
- patterns/jittered-job-scheduling
- patterns/feature-flagged-job-enqueue-rejection
- companies/planetscale