CONCEPT Cited by 1 source
Tail-latency spike during queueing¶
Definition¶
A tail-latency spike during queueing is the pathology where a small backend slowdown produces a disproportionately large increase in P99/P99.9 latency because a FIFO queue forms and subsequent requests serialise behind the slow-head request. Even a brief origin latency increase — often just a few hundred milliseconds — can turn into multi-second P99 as the queue drains.
The mechanism is straightforward: if the backend is slower than its arrival rate for a window, the queue grows by (arrival − service) per unit time. FIFO means every waiting request has to be served before newer ones are even considered, so the newest request's latency is total queue wait + its own service time — not its own service time alone.
Why FIFO is the default and why it hurts tails¶
- FIFO is fair — oldest-request-first is intuitive and matches social expectations about queuing.
- FIFO is simple — a bare
ArrayDequeorLinkedListgets you there. - FIFO is hostile to tail latency under transient slowdowns — a brief upstream spike at time t punishes every request that arrives in the window [t, t+drain], not just the one that caused the stall. For a latency-sensitive API where clients are time-bounded (e.g. 10 ms timeout), FIFO turns a 100 ms origin spike into 100 ms × queue-depth latency for everything behind it.
Zalando's PRAPI documents this explicitly:
"In latency-sensitive applications, FIFO queuing can create long-tail latency spikes." (Source: sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)
The LIFO alternative¶
LIFO (last-in-first-out) inverts the discipline: newest arrivals are served first. Under a backend slowdown, the queue still grows, but new requests jump ahead of the piled-up stale ones. Old queued requests may time out or be abandoned — that's the point: time-bounded clients are going to fail anyway, and it's better to deliver fresh requests quickly than to deliver stale requests slowly.
PRAPI applied LIFO in two places:
- Load balancer queue. "While we aim to avoid request queuing, switching to LIFO reduced long-tail latency spikes when queuing occurred."
- DynamoDB client queue. Paired with a two-client fallback architecture (10 ms primary, 100 ms fallback for retries) — the fast client's queue is LIFO-disciplined so a DynamoDB latency spike doesn't cascade through every inflight request.
When LIFO is wrong¶
- Strict-order or transactional workloads — if request order affects correctness, LIFO can violate invariants.
- Unbounded work-queue scenarios — LIFO starves the oldest requests; with unbounded queues this means some requests may never be served. A drain/evict mechanism for stale tail entries is required.
- Work that must complete — e.g. mutating API calls with no idempotency key. LIFO without a timeout discipline can indefinitely delay queued mutations.
LIFO is a tail-latency optimisation for read-heavy, time-bounded, retry-tolerant paths. It's a poor default for durable write pipelines.
Related mechanisms¶
- concepts/head-of-line-blocking — the general name for the phenomenon across protocols (HTTP/1, TCP streams, broker partitions). FIFO queues are a common source of HoL blocking.
- concepts/backpressure — the upstream flow-control alternative. Instead of queuing, push back on producers.
- concepts/timeout — the per-request ceiling that lets LIFO prune stale entries safely.
Seen in¶
- sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api — canonical description of the LIFO-vs-FIFO trade-off at both the LB and DB-client altitudes in a single production system.
Related¶
- concepts/head-of-line-blocking · concepts/backpressure · concepts/timeout
- patterns/lifo-queuing-for-tail-latency — the pattern form
- systems/zalando-prapi — production consumer