Skip to content

CONCEPT Cited by 1 source

State in database, not queue

The authoritative state-of-record for pending work lives in the application database, not in the job queue. The queued job is a derivative — an optimisation for low-latency execution — not a durable commitment. Consequence: losing the queue is a performance problem, not a correctness problem.

The invariant

When work needs doing, two things happen atomically enough for the invariant to hold:

  1. A row is written to the application database recording the intent (e.g. databases row with state: 'pending').
  2. A job is enqueued to execute the intent (e.g. DatabaseCreationJob.perform_async(id)).

Only #1 is the source of truth. #2 is an optimisation that lets the work run within worker latency (ms-scale) instead of within scheduler-tick latency (minute-scale).

The critical property: a scheduled job periodically re-derives step #2 from step #1. If #2 is lost — dropped perform_async, Redis flush, Redis corruption, operator mistake — the scheduler reconstructs it from the database.

Why this is not the default mental model

The default async-job mental model treats the queue as the integration point: "put something on the queue" is the action, and the queue's state is where pending work lives. That model assumes:

  • The queue is durable (survives crashes).
  • The enqueue call is reliable (can't be dropped).
  • Queue mutations are ordered and visible to workers.

In practice, Redis-backed queues have modest durability (AOF sync windows, replication lag, operator errors), perform_async is a fire-and-forget network call that can fail after the caller thinks it succeeded, and multi-writer semantics can drop silently. Treating the queue as the state of record is implicitly trusting these properties.

The state-in-database model inverts the trust: the database is the only store whose durability is considered load-bearing. The queue is advisory.

How it works in practice

From sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing:

"The solution is storing the state in our PlanetScale database. When creating a database for a user, we also create a record in our databases table immediately. This record starts with a state set to pending."

"This allows us to have a scheduled job that runs once a minute and checks if any databases are in a pending state. If they are, that triggers the creation job to get enqueued again."

The scheduler query is the inverse of the database model's state machine: for each terminal-work-job class, the database has an expression "a row is in a state implying this work should run." The scheduler iterates rows matching that expression and enqueues the job.

Consequences

  • Queue data loss is non-destructive. "If we lose all data in the queues at any time, we can recover without any loss in functionality." The recovery mechanism is the next scheduler tick.
  • Jobs must be idempotent. Re-enqueue from scheduler + existing queue entry + Sidekiq retry can all fire for the same database row; every job must tolerate being run more than once. See concepts/idempotent-job-design.
  • Database schema has shape obligations. There must be a column (or expression) indexable for "is work pending?" The scheduler's query must be cheap to evaluate on every tick.
  • Throughput budget shifts to the database. The scheduler tick issues a query per minute per worker-class. At scale this is non-trivial; it requires indexes on state columns.

An alternative framing: the queue is a materialised view over a database query. The database is the source; the queue is a cache with weaker durability. A stale or missing cache row doesn't lose data, just adds latency.

This framing generalises to other substrates:

  • Kubernetes Pod state: the spec is in etcd; the kubelet's observed state is a derivable view.
  • Outbox pattern: a DB row is written, a publisher later reads the row and publishes to Kafka; Kafka is downstream of the DB, not co-authoritative.
  • Event sourcing: the event log is the source; projections are derivables that can be rebuilt.

When the invariant fails

  • Work without a natural row. A one-off "send this specific email right now" job has no row in the database representing its pendingness. You can't scheduler-ify it without inventing a scheduled_emails table first.
  • Cross-store work. If pending work is implied by state in Redis + MySQL + an external vendor, the scheduler can't issue one authoritative query. Either consolidate state first, or accept loss of derivability.
  • State is destroyed by job execution. If FooJob deletes the row it was dispatched for, and state: 'pending' is the trigger, the scheduler relies on the job updating state to 'done' rather than deleting the row. Deletion semantics must be reconsidered.

Seen in

Last updated · 378 distilled / 1,213 read