CONCEPT Cited by 1 source

Tail-latency spike during queueing¶

Definition¶

A tail-latency spike during queueing is the pathology where a small backend slowdown produces a disproportionately large increase in P99/P99.9 latency because a FIFO queue forms and subsequent requests serialise behind the slow-head request. Even a brief origin latency increase — often just a few hundred milliseconds — can turn into multi-second P99 as the queue drains.

The mechanism is straightforward: if the backend is slower than its arrival rate for a window, the queue grows by (arrival − service) per unit time. FIFO means every waiting request has to be served before newer ones are even considered, so the newest request's latency is total queue wait + its own service time — not its own service time alone.

Why FIFO is the default and why it hurts tails¶

FIFO is fair — oldest-request-first is intuitive and matches social expectations about queuing.
FIFO is simple — a bare ArrayDeque or LinkedList gets you there.
FIFO is hostile to tail latency under transient slowdowns — a brief upstream spike at time t punishes every request that arrives in the window [t, t+drain], not just the one that caused the stall. For a latency-sensitive API where clients are time-bounded (e.g. 10 ms timeout), FIFO turns a 100 ms origin spike into 100 ms × queue-depth latency for everything behind it.

Zalando's PRAPI documents this explicitly:

"In latency-sensitive applications, FIFO queuing can create long-tail latency spikes." (Source: .)

The LIFO alternative¶

LIFO (last-in-first-out) inverts the discipline: newest arrivals are served first. Under a backend slowdown, the queue still grows, but new requests jump ahead of the piled-up stale ones. Old queued requests may time out or be abandoned — that's the point: time-bounded clients are going to fail anyway, and it's better to deliver fresh requests quickly than to deliver stale requests slowly.

PRAPI applied LIFO in two places:

Load balancer queue. "While we aim to avoid request queuing, switching to LIFO reduced long-tail latency spikes when queuing occurred."
DynamoDB client queue. Paired with a two-client fallback architecture (10 ms primary, 100 ms fallback for retries) — the fast client's queue is LIFO-disciplined so a DynamoDB latency spike doesn't cascade through every inflight request.

When LIFO is wrong¶

Strict-order or transactional workloads — if request order affects correctness, LIFO can violate invariants.
Unbounded work-queue scenarios — LIFO starves the oldest requests; with unbounded queues this means some requests may never be served. A drain/evict mechanism for stale tail entries is required.
Work that must complete — e.g. mutating API calls with no idempotency key. LIFO without a timeout discipline can indefinitely delay queued mutations.

LIFO is a tail-latency optimisation for read-heavy, time-bounded, retry-tolerant paths. It's a poor default for durable write pipelines.

concepts/head-of-line-blocking — the general name for the phenomenon across protocols (HTTP/1, TCP streams, broker partitions). FIFO queues are a common source of HoL blocking.
concepts/backpressure — the upstream flow-control alternative. Instead of queuing, push back on producers.
concepts/timeout — the per-request ceiling that lets LIFO prune stale entries safely.

Seen in¶

— canonical description of the LIFO-vs-FIFO trade-off at both the LB and DB-client altitudes in a single production system.
sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — user-visible tail-latency spike canonical instance in a different substrate (Elasticsearch, not the PRAPI serving API). "Queries that usually took milliseconds were now dragging on for seconds, and some requests were timing out altogether. Users started seeing empty result pages, or pages with just a few items." The mechanism is the same — a brief slowdown (pathological facet queries pinning the search thread pool) compounds into multi-second tails as the queue drains; new requests see total-queue-wait + service, not just service. The 2025-12-16 case shows the same pathology surfaces at backend-thread-pool altitude (ES search pool), not just at proxy / LB altitude.

concepts/head-of-line-blocking · concepts/backpressure · concepts/timeout
concepts/self-inflicted-dos — a common originating cause of sustained queue growth
patterns/lifo-queuing-for-tail-latency — the pattern form
systems/zalando-prapi — production consumer
systems/zalando-base-search — another production consumer where the same pathology produces the same symptom class