CONCEPT

Self-healing job queue¶

A self-healing job queue is an async-job system architected so that losing the queue's contents — whether from Redis data loss, operator mistake, corruption, or a drop in transit — does not cause work to be lost. The system recovers automatically without operator intervention, because the work was never only in the queue to begin with.

The system property¶

Verbatim from :

"If we lose all data in the queues at any time, we can recover without any loss in functionality."
"If a single job fails, it will be automatically re-run."

Property #1 is the self-healing property; property #2 is per-job retry and is conventional Sidekiq behavior. Self- healing is the systemic property: the system continues to work even under total queue-layer loss.

How it's achieved¶

Self-healing requires three compositional pieces:

State in database, not queue — the authoritative state lives in a durable store (typically the application DB). The queue holds derivatives.
Paired scheduler-reconciler — a scheduler job runs periodically, queries the authoritative store for pending work, and re-enqueues it. This is the recovery mechanism.
Idempotent job design — because jobs can be enqueued multiple times (user + scheduler + retry), every job must tolerate being run more than once.

No one piece is sufficient:

Idempotence without a scheduler: if the queue loses the job, nothing re-enqueues it; the row stays pending forever.
Scheduler without idempotence: duplicates pile up; user-visible actions happen multiple times.
Authoritative-DB without scheduler: pendingness is recorded but nothing checks it.

Comparison: conventional Sidekiq vs self-healing Sidekiq¶

Conventional: user action calls perform_async; if Redis is flushed, the job is lost; user action produces no effect; operator intervention (or a very angry user support ticket) required to re-dispatch work.

Self-healing: user action writes a DB row and calls perform_async; if Redis is flushed, the scheduler notices the DB row is still pending next tick and re-enqueues; user action eventually takes effect with ~scheduler-tick-latency delay, no intervention required.

Scope¶

Self-healing here is queue-layer fault tolerance. It protects against:

Redis data loss (operator flush, upgrade mishap, corruption).
Dropped perform_async calls (network failure after write commits but before Redis write).
Silent queue drops (misconfigured worker pool, middleware that unintentionally discards jobs).

It does not protect against:

Database data loss. The DB is the source of truth; if it's lost, the work is lost. Self-healing at this layer requires DB-layer disaster recovery (backups, PITR, multi-region replication).
Bugs in job code. A FooJob that crashes consistently will crash consistently on every re-enqueue. Self- healing recovers from infrastructure loss, not application-logic errors.
Scheduler failure. If the scheduler job itself stops running, the healing stops. No meta-healer for the healer (in the PlanetScale architecture).

Operational verification¶

The property is testable: drop the queue, watch the system recover. From the canonical source:

"We've put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience."

Teams that have this property can treat queue-layer maintenance (Redis upgrades, failovers, reprovisioning) as non-incident operations. Teams that don't have it treat queue-layer maintenance as customer-affecting changes.

Relationship to other fault-tolerance concepts¶

Fault-tolerant long-running workflow (Temporal / Cadence / workflow engines) — same goal at a different altitude. Workflow engines store workflow state durably (Postgres, etc.) and replay from state; the "queue" concept is absent but the same source-of-truth inversion applies.
Idempotency token — per-operation retry safety. Self-healing composes idempotence with re-derivation.
Outbox pattern — related: application writes DB row
outbox row atomically; a separate publisher pushes outbox rows to the queue. Self-healing subsumes this when the scheduler reads directly from the DB state (no separate outbox required).

Seen in¶

— canonical wiki introduction. PlanetScale's Rails control plane is architected for the queue-data-loss recovery property, verified by operational experience ("tested this year, dumped queues, no user-facing impact"). The title of the post names the property explicitly; the body canonicalises the composition of authoritative DB state + paired scheduler + idempotence as the enabling architecture.