CONCEPT Cited by 5 sources
Queueing theory (as applied to storage/IO stacks)¶
Definition¶
Queueing theory is the math of how waiting lines form and drain when arrivals are asynchronous. Applied to systems: between the CPU and durable storage there is always a chain of queues — OS kernel ↔ host storage adapter ↔ fabric ↔ target adapter ↔ media — and each queue is a candidate bottleneck, a variance amplifier, or a place for one tenant's workload to interfere with another's.
The bank metaphor (Olson, 2024)¶
The EBS retrospective explains it as a bank:
You walk into the bank with a deposit, but first you have to traverse a queue before you can speak with a bank teller… In a perfect world, the number of patrons entering the bank arrive at the exact rate at which their request can be handled, and you never have to stand in a queue. But the real world isn't perfect. The real world is asynchronous.
Key implications:
- Averages hide the experience of the last person in line. Even if average wait is "acceptable," the first customer had zero wait, the last had a long one.
- Unless you have infinite resources, queues are necessary to absorb peak load. Queueless = over-provisioned.
- Options to improve any queue boil down to: add workers (parallelism), speed up each worker (lower service time), split queues by class (priority/SLO lanes). Each costs money.
In network storage stacks¶
In a network-attached block-storage system (e.g. systems/aws-ebs) there are several stacked queues:
- OS kernel ↔ storage adapter
- Host storage adapter ↔ storage fabric
- Target storage adapter
- Storage media itself
Legacy stacks use different vendors per layer (fiber channel, iSCSI, NFS over TCP), and tuning "the" storage network requires specialized knowledge distinct from tuning the app or the media.
The hidden failure mode is cross-layer interference: a queue-depth choice at one layer couples tenants at an adjacent layer. Classic example from the EBS post — Xen block-device default ring-queue parameters capped the host at 64 outstanding IOs across all devices, a per-host noisy-neighbor source that only surfaced via patterns/loopback-isolation.
Why this matters for EBS-class systems¶
- Hard drives amplified variance. 120-150 IOPS/drive hard ceiling; command reordering and queue depth push tail latency into the hundreds of ms. Spreading one tenant across many drives reduced their worst case but widened the blast radius — classic noisy-neighbor compounding.
- SSDs didn't fix the stack. They collapsed media latency, which shifted the bottleneck up to queues in OS, hypervisor, and network.
- Queue reduction is a north star. systems/nitro offload removes hypervisor queue layers. systems/srd replaces TCP with a protocol whose congestion/retransmit behavior doesn't impose an ordering queue storage IOs don't need.
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — the framing chapter of the post ("Queueing theory, briefly") and the through-line for every architectural move in EBS's history.
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — the object-storage flavor: a single HDD delivers ~120 random-access IOPS and that number has been flat since before S3 launched in 2006 (concepts/hard-drive-physics). Once a drive queues, every layer that was waiting on it queues — metadata lookup, concepts/erasure-coding reconstruct, frontend request. Hotspots are therefore the dominant latency story, answered by concepts/heat-management rather than by tuning per-drive service time (which can't move much).
Seen in — GPU inference serving (Voyage AI, 2025)¶
- sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference
— queueing theory applied to the request queue in front of GPU
inference workers, not to storage. Voyage
AI by MongoDB frames the queue-design problem for
token-count-based batching
explicitly: the queue must attach a per-item cost estimate
(
token_count), support peek across pending items, and atomically claim a subset whose total cost fits a budget (see patterns/atomic-conditional-batch-claim). General-purpose brokers — RabbitMQ's request-count prefetch, Kafka's byte-based partition batching — don't satisfy these primitives. Voyage AI picks Redis with a single-threaded Lua script as the native substrate, with per-item TTLs set in the same atomic call. The classical queueing moves (add workers, speed workers, split queues by class) here take the form of: align the queue's batching axis with the downstream worker's cost axis (tokens → GPU FLOPs via padding removal), so burst absorption stays in the flat region of the inference latency curve instead of piling up behind a memory-bound GPU.
Queueing in front of SQL query clusters¶
sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale adds another queueing axis: the queue of submitted queries in front of a Presto cluster. At Meta-scale the queueing story shows up in three concrete ways:
-
Queueing time is a customer-facing SLA metric. "Defining SLAs around important metrics like queueing time and query failure rate" is Meta's first piece of scaling advice. The customer doesn't care about service-time vs wait-time decomposition — they care about total latency from submission to first row, and queueing is the variable part.
-
Routing is a queueing-minimization problem. The Meta Presto Gateway routes each query considering "current state of queuing on Presto clusters, distribution of hardware across different datacenters, the data locality of the tables that the query uses." In queueing-theory terms: Meta is doing join-the-shortest-queue routing conditioned on data-locality affinity — a form of the classical "split queues by class" option where the "class" is (workload, data region).
-
Queueing pain root-causes are hard to diagnose by dashboard. Meta notes that "given this complexity, it can be very hard for an oncall to figure out the root cause of any queueing problems encountered in production" and builds dedicated analyzers that pull multi-source signals (cluster queue state, routing decisions, hardware topology) and output probable root cause (see concepts/automated-root-cause-analysis, patterns/oncall-analyzer). The general-purpose queueing equivalent: at scale, the decision surface for why a queue is long has enough inputs that a generic dashboard does not reduce oncall cognitive load; purpose-built analyzers do.
Queueing as the basis of database-throttler metric design¶
sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1 applies queueing theory to throttler signal selection in a MySQL database context. Shlomi Noach's core claim is that every load-predicting metric is a symptom that summarises some underlying queue:
| Metric | Underlying queue(s) | Queue type |
|---|---|---|
| concepts/replication-lag | Changelog event queue (net + disk + apply + commit) | Wait time |
| concepts/threads-running-mysql | Commit queue OR lock-wait queue OR page-cache-miss queue | Length |
| concepts/transaction-commit-delay | Commit / fsync queue | Wait time |
| concepts/load-average | CPU run-queue + uninterruptible-I/O queue | Length (with wait-time correction) |
| concepts/connection-pool-exhaustion | New-connection queue | Length (saturation-gated) |
Noach's argument synthesises several queueing-theory results:
- Wait time is the better signal where measurable (see concepts/queue-length-vs-wait-time). Replication lag and commit delay are wait times; the others are lengths.
- Length is the cheaper fallback — load average and
threads_runningare length-gauges, useful when wait-time instrumentation is absent. - Thresholds must be derived from physics, not guessed. Pool
size was chosen for hardware reasons; commit delay floor is
fsync-bound; the throttler inherits existing thresholds rather
than introducing new artificial ones. Thresholds without physical
backing (
threads_running > 100? load-average > 1? etc.) don't hold across environments. - Combinations of queue-summary metrics catch orthogonal failure modes — the argument for multi-metric throttling.
A database throttler is therefore best understood as a control loop reading queue-health summaries and admit-gating work against them. Vitess's throttler is the canonical instance of this shape.