CONCEPT Cited by 3 sources
Queueing theory (as applied to storage/IO stacks)¶
Definition¶
Queueing theory is the math of how waiting lines form and drain when arrivals are asynchronous. Applied to systems: between the CPU and durable storage there is always a chain of queues — OS kernel ↔ host storage adapter ↔ fabric ↔ target adapter ↔ media — and each queue is a candidate bottleneck, a variance amplifier, or a place for one tenant's workload to interfere with another's.
The bank metaphor (Olson, 2024)¶
The EBS retrospective explains it as a bank:
You walk into the bank with a deposit, but first you have to traverse a queue before you can speak with a bank teller… In a perfect world, the number of patrons entering the bank arrive at the exact rate at which their request can be handled, and you never have to stand in a queue. But the real world isn't perfect. The real world is asynchronous.
Key implications:
- Averages hide the experience of the last person in line. Even if average wait is "acceptable," the first customer had zero wait, the last had a long one.
- Unless you have infinite resources, queues are necessary to absorb peak load. Queueless = over-provisioned.
- Options to improve any queue boil down to: add workers (parallelism), speed up each worker (lower service time), split queues by class (priority/SLO lanes). Each costs money.
In network storage stacks¶
In a network-attached block-storage system (e.g. systems/aws-ebs) there are several stacked queues:
- OS kernel ↔ storage adapter
- Host storage adapter ↔ storage fabric
- Target storage adapter
- Storage media itself
Legacy stacks use different vendors per layer (fiber channel, iSCSI, NFS over TCP), and tuning "the" storage network requires specialized knowledge distinct from tuning the app or the media.
The hidden failure mode is cross-layer interference: a queue-depth choice at one layer couples tenants at an adjacent layer. Classic example from the EBS post — Xen block-device default ring-queue parameters capped the host at 64 outstanding IOs across all devices, a per-host noisy-neighbor source that only surfaced via patterns/loopback-isolation.
Why this matters for EBS-class systems¶
- Hard drives amplified variance. 120-150 IOPS/drive hard ceiling; command reordering and queue depth push tail latency into the hundreds of ms. Spreading one tenant across many drives reduced their worst case but widened the blast radius — classic noisy-neighbor compounding.
- SSDs didn't fix the stack. They collapsed media latency, which shifted the bottleneck up to queues in OS, hypervisor, and network.
- Queue reduction is a north star. systems/nitro offload removes hypervisor queue layers. systems/srd replaces TCP with a protocol whose congestion/retransmit behavior doesn't impose an ordering queue storage IOs don't need.
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — the framing chapter of the post ("Queueing theory, briefly") and the through-line for every architectural move in EBS's history.
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — the object-storage flavor: a single HDD delivers ~120 random-access IOPS and that number has been flat since before S3 launched in 2006 (concepts/hard-drive-physics). Once a drive queues, every layer that was waiting on it queues — metadata lookup, concepts/erasure-coding reconstruct, frontend request. Hotspots are therefore the dominant latency story, answered by concepts/heat-management rather than by tuning per-drive service time (which can't move much).
Seen in — GPU inference serving (Voyage AI, 2025)¶
- sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference
— queueing theory applied to the request queue in front of GPU
inference workers, not to storage. Voyage
AI by MongoDB frames the queue-design problem for
token-count-based batching
explicitly: the queue must attach a per-item cost estimate
(
token_count), support peek across pending items, and atomically claim a subset whose total cost fits a budget (see patterns/atomic-conditional-batch-claim). General-purpose brokers — RabbitMQ's request-count prefetch, Kafka's byte-based partition batching — don't satisfy these primitives. Voyage AI picks Redis with a single-threaded Lua script as the native substrate, with per-item TTLs set in the same atomic call. The classical queueing moves (add workers, speed workers, split queues by class) here take the form of: align the queue's batching axis with the downstream worker's cost axis (tokens → GPU FLOPs via padding removal), so burst absorption stays in the flat region of the inference latency curve instead of piling up behind a memory-bound GPU.