Skip to content

PATTERN Cited by 1 source

Region fallback on queue backlog

Problem

A regional allocation control plane is fed by a regional batch queue (patterns/queue-batching-amortizes-db-write-throughput). Under burst load — or when the consumer Worker degrades — the queue's lag grows beyond the staleness budget the control plane can tolerate. The control plane's view of "which resources are available in this region" becomes stale, and routing decisions based on the stale view risk the same overallocation / race-condition failure modes that motivated migrating off eventually-consistent stores in the first place.

The primary's data plane is healthy (running resources still serve traffic). Only the control-plane freshness is degraded.

Pattern

Each region designates a backup region. When the primary region's queue lag exceeds a threshold, the control plane switches its read view to the backup region's allocation state until the primary catches up. The data plane (resources already running in the primary) keeps serving; only new allocation reads are temporarily routed to the backup.

[primary region]                     [backup region]
   queue: lag >> threshold              queue: healthy
        |                                     ^
        |                                     |
        +---- control-plane reads switch ----+

   data plane: still serving requests
                    |
                    v
              [running resources in primary]
              (continue normally)

The fallback is bounded — when the primary's queue catches up (lag drops below threshold), the control-plane reads switch back.

Verbatim canonical articulation

From the 2026-05-13 Browser Run migration post (Source: sources/2026-05-13-cloudflare-browser-run-now-running-on-cloudflare-containers-its-faster):

"With this configuration, we achieve acceptable lag times well below 2 seconds. That said, queue backlogs can still cause stale state. When this happens, each region falls back to a designated backup region until the primary queue catches up."

Two implications named explicitly:

  1. Steady-state lag <2 seconds is acceptable. The pattern is a backstop for excursions, not a continuous workaround.
  2. Backup region is designated per primary. Not a discovery-based or load-balanced choice — explicit pairing.

Distinction from sibling patterns

  • Cross-region failover — full data-plane traffic move. This pattern is control-plane-only with the data plane unaffected.
  • DR / disaster recovery — typically triggered by primary unavailability. This pattern is triggered by primary control-plane staleness — the primary is up, just lagged.
  • Automatic reads-anywhere replication — the fallback is explicit + thresholded, not a continuous all-region read fan-out.

Preconditions

  1. Allocation state is regionally partitioned, with a primary region per workload and at least one backup region per primary.
  2. Backup region maintains a view of the primary's allocation state — either the primary's queue mirrors into the backup's DB, or the backup periodically scrapes the primary, or the primary's writers fan-out to backup on degraded primary signal.
  3. Lag detection is cheap and timely — a control plane that can't quickly detect its own staleness can't trigger the fallback.
  4. Cross-region read has acceptable latency — the fallback path pays cross-region RTT on every allocation lookup; this must still be better than serving from stale primary state.

When the pattern fits

  • Allocation control planes with regionally-bounded authority — Browser Run's per-region resource pools fit exactly this shape.
  • Workloads with tolerance for occasional stale-state excursions — the backup region's view is also slightly delayed (replication lag, scraping cadence); the fallback improves worst-case freshness, not absolute freshness.
  • Operations where the data plane is regionally pinned — resources don't move regions on demand, so there's no cross-region coordination needed for the data plane during the fallback.

When the pattern doesn't fit

  • Single-region deployments — no backup region to fall back to.
  • Latency-critical allocation paths — cross-region RTT on every read may be unacceptable for sub-millisecond allocation budgets.
  • Workloads with cross-region resource mobility — if resources move regions during fallback, the control plane needs full multi-region consensus, not designated-backup semantics.
  • Symmetric-load workloads — if every region is equally busy, designating one as backup-of-the-other doubles control-plane read load at exactly the wrong time. Better: full quorum or a separate observability tier.

Failure modes

  • Backup also overloaded. The fallback assumes the backup has spare control-plane capacity. Under correlated bursts (region-pair simultaneous spike), this assumption fails.
  • Fallback flapping. Lag dips below threshold → switch back → lag spikes above → switch out → repeat. Hysteresis bands on the threshold are the standard mitigation.
  • Two-region race. During the switchover, both primary and backup may briefly serve the same allocation read; if the underlying allocation primitive isn't atomic (concepts/sqlite-transaction-for-atomic-resource-claim), this can re-introduce overallocation.
  • Backup-region view of primary is stale by definition. Replication / scrape / fan-out pipelines have their own lag; the fallback is "less stale than primary right now", not "fresh."
  • Data plane region-mismatch. Resources running in primary may still be the right answer when the control-plane lag is just an observability artefact — i.e. the fallback may appear healthier without the underlying state being better.

Composes with

  • patterns/queue-batching-amortizes-db-write-throughput — the upstream pattern that generates the queue this pattern falls back from. The two patterns are joined at the design hinge: batching is the throughput optimisation; this fallback is the failsafe when the optimisation's steady-state assumption (lag <2s) is exceeded.

Seen in

Last updated · 542 distilled / 1,571 read