PLANETSCALE 2022-02-17 Tier 3

PlanetScale — How we made PlanetScale's background jobs self-healing¶

Mike Coutermarsh's third wiki ingest (companion to his 2022-01-18 Rails-CI post and 2024-04-04 schema-change-workflow post) canonicalises PlanetScale's core design discipline for background jobs in the application tier: the job queue is derivable from authoritative database state, not the source of truth. A paired scheduler job runs periodically and re-enqueues any work the database says still needs doing, so the system recovers automatically if the entire Redis queue is lost.

Summary¶

PlanetScale's Rails API runs Sidekiq jobs for critical user-facing actions — database creation, branch creation, schema deployment, backups. Coutermarsh names two hard requirements set at system design time:

"If we lose all data in the queues at any time, we can recover without any loss in functionality."
"If a single job fails, it will be automatically re-run."

The motivating framing is "not if, but when" — Redis data loss, dropped jobs, and silent failures will happen; the architecture must treat them as expected events rather than incidents.

The central architectural primitive: for each job, there is another job whose responsibility is to schedule it to run. The "scheduler" job runs on a cron-like cadence (every minute, every 5 minutes) and queries the application's authoritative database (MySQL, via PlanetScale's own product) for records in a state that implies work is pending. For each such record, it enqueues the work job. Because the check is from database state — not from queue state — flushing the queue is non-destructive: the next scheduler run reconstitutes it.

"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."

The post walks through four complementary mechanisms:

Paired scheduler jobs that derive work from DB state (the core pattern).
Bulk enqueue via perform_bulk to avoid Redis-round-trip amplification at scale.
Jittered scheduling via perform_with_jitter to spread out bursts of simultaneous jobs hitting external APIs.
Three ways to handle the duplicate-job problem that paired-scheduler composition introduces: (a) early-exit on DB state re-check, (b) DB row locks around non-idempotent mutations, (c) Sidekiq Enterprise unique_for: at the framework layer.

The post also reuses the already-canonical Flipper-gated middleware from Elom Gomez's sibling 2022 post, here to disable scheduler jobs themselves during an incident without a deploy.

Key takeaways¶

State is authoritative in the database, not the job queue. When a user triggers "create database," the application synchronously writes a databases row with state: pending and enqueues DatabaseCreationJob. Whichever arrives first doesn't matter — the row is the source of truth. The queue is an optimisation for low-latency execution, not a state-of-record. This inverts the common "queue is the integration point" mental model: the queue here is a derivative. (canonicalised as concepts/state-in-database-not-queue).
Paired scheduler jobs reconstruct the queue from DB state. For every FooJob, there's a ScheduleFooJobs job that runs periodically and scans the DB for records implying FooJob should run. This is a reconciler pattern applied to async jobs — the scheduler brings observed state (queued jobs) closer to desired state (DB-implied pending work) on each tick. "As long as the scheduler job is running, we could dump our entire queue and still recover."
Idempotence is a load-bearing assumption of the pattern. Because any job can re-fire (user-triggered enqueue + scheduler re-enqueue + retry), every job's first operation is a state-re-check: "store state in our database and quickly exit a job if it no longer needs to be run." Non-idempotent subsections are protected with backup.with_lock { ... } database row locks. Sidekiq Enterprise's unique_for: 1.minute, unique_until: :start provides a framework-layer guard against duplicate enqueues. Three layers of protection, each for a different failure class.
Bulk enqueue matters at scale. At low volume ("thousands"), enqueuing jobs one at a time works. Past that, "we started spending a lot of time sending individual Redis requests" — and they moved to BackupJob.perform_bulk(backup_policies.pluck(:id)) with an in_batches do |backup_policies| loop, batches of 1,000. One Redis command per batch, not one per job. An O(N)-to-Redis pattern becomes O(N/1000).
Jitter prevents downstream thundering herds. CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes) is a custom helper: set(wait: rand(min_wait..max_wait)) .perform_later(*args). When the scheduler enqueues 10,000 CleanUpJobs in bulk, without jitter they all hit Redis and then the workers and then the downstream external API in the same ~1-second window. Spreading them over 30 minutes keeps each recipient's rate within its limit.
Duplicate-job handling has three compositional layers. Each layer solves a distinct failure mode:
return unless user.pending? — idempotence at application level (cheapest).
backup.with_lock do ... end — database-level mutex for mutation (correct under concurrent workers).
sidekiq_options unique_for: 1.minute — framework- level deduplication (prevents enqueue storm, not execution overlap).
Scheduler jobs are themselves Sidekiq jobs, gated by the Flipper kill-switch. The SCHEDULED_JOBS.key?(klass) middleware check from Gomez's sibling post disables scheduler jobs on demand: "added the ability to stop our scheduled jobs from running at any time. This has come in useful during an incident where we've wanted control over a specific job type." The scheduler itself is subject to the same deploy-less operational lever as any other job.

Architectural numbers and primitives¶

Tick cadence: worked example uses "every minute" (ScheduleDatabaseJobs) and "every 5 minutes" (ScheduleBackupJobs). Cadence is a tunable per scheduler — higher for user-facing create paths, lower for periodic maintenance.
Batch size: 1,000 jobs per perform_bulk call, illustrated with in_batches cursor iteration.
Jitter window example: 30 minutes max_wait; default in the MAX_WAIT constant is 1 minute.
Unique-for example: unique_for: 1.minute, unique_until: :start on CheckDeploymentStatusJob — deduplication holds until the job starts, not until it finishes, so long-running execution doesn't block the next legitimate enqueue.
Queue used: scheduler jobs run on queue "background" — isolated from latency-sensitive queues so scheduler latency doesn't compete with user-triggered work.

Systems¶

Sidekiq — the background-job substrate; the post uses its perform_async, perform_bulk, perform_later / set(wait:), client middleware, and Sidekiq Enterprise unique_for APIs.
Redis — Sidekiq's queue substrate; canonicalised here as an ephemeral optimisation, not the state of record. The whole post is a discipline for tolerating Redis data loss.
Ruby on Rails — the framework; ActiveJob's perform_later(*args) API, ActiveRecord with_lock, and in_batches cursor iteration are used.
Flipper — referenced via Gomez's sibling post for the kill-switch middleware that can disable scheduler jobs.

Concepts¶

State in database, not queue — the foundational invariant: the database row (databases.state = 'pending') is the source of truth; the queued job is a derivative that can be reconstructed.
Self-healing job queue — the overall system property: the queue recovers from total data loss via re-derivation from the authoritative store. Named explicitly in the post title.
Idempotent job design — the caller-side discipline that makes paired-scheduler composition safe: every job's first operation checks whether the work still needs doing.
Sidekiq unique jobs — Sidekiq Enterprise feature; unique_for + unique_until framework-level dedup on the enqueue path.
Thundering herd — the failure mode perform_with_jitter prevents when the scheduler bulk-enqueues many jobs that all call the same external API.

Patterns¶

Paired scheduler–reconciler — canonical new wiki pattern introduced by this ingest. For each user-triggered job, install a scheduled job that queries authoritative DB state for pending work and re-enqueues. Reconciler shape applied to async-job frameworks.
Bulk job enqueue — amortise Redis round-trips with a single perform_bulk([ids...]) call over batches of N (e.g. 1,000). Reduces scheduler-tick cost from O(N) to O(N/batch).
Jittered job scheduling — when bulk-enqueuing, spread execution over a window rand(0..max_wait) to avoid thundering herd at downstream recipients.
Feature-flagged job-enqueue rejection — sibling pattern used here to disable scheduler jobs on demand.

Operational numbers¶

None disclosed. No incident retrospective counts, no volume numbers for PlanetScale's actual Sidekiq queues, no latency / throughput measurements on scheduler ticks, no quantification of the "couple of times" Redis-dump recoveries named in the text.

Caveats¶

Schedule cadence vs user-perceived latency. A 1-minute scheduler tick means a user-triggered enqueue that races a Redis flush has up to 60 seconds of observable "nothing happening" before the scheduler re-enqueues. For create-a- database paths this is in the acceptable range; for tighter SLOs it isn't. The post doesn't discuss the SLO axis.
Scheduler-job cost is always-on, even when there's nothing to do. Database.pending.find_each is issued every minute regardless of whether any databases are pending. At scale, the always-on cost of the scheduler tier may exceed the cost of the work it dispatches; the post doesn't address this.
DB queries must be indexed for the state column. Database.pending implies a scan over databases WHERE state = 'pending'. At scale this must be index-backed, otherwise the scheduler becomes a recurring table scan. Not discussed.
Idempotence is a burden on every job author. The three-layer dedup composition (early-exit, DB lock, unique-for) shifts complexity from the infra tier to every job implementation. The post presents this as uncontroversial but at scale the discipline is load- bearing and easily violated.
Works only when state of record is centralised. PlanetScale's model has one authoritative database (their own control-plane MySQL). Applications with work spread across multiple data stores can't write the equivalent scheduler query without first consolidating state.
No discussion of scheduler-of-scheduler failure modes. If the scheduler itself fails (Sidekiq cluster outage, Sidekiq-cron misconfig, deploy bug), nothing re-enqueues. The system's self-healing property depends on the scheduler running; there's no meta-scheduler for the scheduler. The kill switch can disable a scheduler, but there's no paired mechanism to detect a stopped scheduler.
2022-era Sidekiq APIs. perform_bulk is 6.x+, unique_for is Sidekiq Enterprise (paid). Shape is durable; APIs have evolved.
Application-tier detail, not PlanetScale product internals. Same framing caveat as Gomez's sibling post — this is how PlanetScale's Rails control plane operates, orthogonal to their MySQL/Vitess product.