Skip to content

CONCEPT Cited by 1 source

Tail-latency spike during queueing

Definition

A tail-latency spike during queueing is the pathology where a small backend slowdown produces a disproportionately large increase in P99/P99.9 latency because a FIFO queue forms and subsequent requests serialise behind the slow-head request. Even a brief origin latency increase — often just a few hundred milliseconds — can turn into multi-second P99 as the queue drains.

The mechanism is straightforward: if the backend is slower than its arrival rate for a window, the queue grows by (arrival − service) per unit time. FIFO means every waiting request has to be served before newer ones are even considered, so the newest request's latency is total queue wait + its own service time — not its own service time alone.

Why FIFO is the default and why it hurts tails

  • FIFO is fair — oldest-request-first is intuitive and matches social expectations about queuing.
  • FIFO is simple — a bare ArrayDeque or LinkedList gets you there.
  • FIFO is hostile to tail latency under transient slowdowns — a brief upstream spike at time t punishes every request that arrives in the window [t, t+drain], not just the one that caused the stall. For a latency-sensitive API where clients are time-bounded (e.g. 10 ms timeout), FIFO turns a 100 ms origin spike into 100 ms × queue-depth latency for everything behind it.

Zalando's PRAPI documents this explicitly:

"In latency-sensitive applications, FIFO queuing can create long-tail latency spikes." (Source: sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)

The LIFO alternative

LIFO (last-in-first-out) inverts the discipline: newest arrivals are served first. Under a backend slowdown, the queue still grows, but new requests jump ahead of the piled-up stale ones. Old queued requests may time out or be abandoned — that's the point: time-bounded clients are going to fail anyway, and it's better to deliver fresh requests quickly than to deliver stale requests slowly.

PRAPI applied LIFO in two places:

  1. Load balancer queue. "While we aim to avoid request queuing, switching to LIFO reduced long-tail latency spikes when queuing occurred."
  2. DynamoDB client queue. Paired with a two-client fallback architecture (10 ms primary, 100 ms fallback for retries) — the fast client's queue is LIFO-disciplined so a DynamoDB latency spike doesn't cascade through every inflight request.

When LIFO is wrong

  • Strict-order or transactional workloads — if request order affects correctness, LIFO can violate invariants.
  • Unbounded work-queue scenarios — LIFO starves the oldest requests; with unbounded queues this means some requests may never be served. A drain/evict mechanism for stale tail entries is required.
  • Work that must complete — e.g. mutating API calls with no idempotency key. LIFO without a timeout discipline can indefinitely delay queued mutations.

LIFO is a tail-latency optimisation for read-heavy, time-bounded, retry-tolerant paths. It's a poor default for durable write pipelines.

  • concepts/head-of-line-blocking — the general name for the phenomenon across protocols (HTTP/1, TCP streams, broker partitions). FIFO queues are a common source of HoL blocking.
  • concepts/backpressure — the upstream flow-control alternative. Instead of queuing, push back on producers.
  • concepts/timeout — the per-request ceiling that lets LIFO prune stale entries safely.

Seen in

Last updated · 501 distilled / 1,218 read