Skip to content

CONCEPT Cited by 2 sources

Spiky traffic

Definition

Spiky traffic is the pattern where request arrivals have high variance on short timescales — bursts arriving within seconds followed by troughs, rather than a smooth Poisson-like arrival process. The defining operational property: the burst's wall-clock duration is shorter than the infrastructure's response time to scale up, so scale-out autoscaling can't smooth the spike in time (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Why it's a distinct capacity-planning problem

Spiky traffic breaks the two standard capacity-planning assumptions:

  1. Provisioning to peak wastes capacity at the trough. If the spike is 10× average QPS for 30 seconds at unpredictable times, provisioning at 10× holds ~90 % of capacity idle the rest of the time — expensive and environmentally wasteful.
  2. Autoscaling can't keep up. Container startup, GPU / model warm-up, LB registration, DNS propagation together put scaling latency at tens of seconds to minutes. Spikes shorter than that see every request hit the pre-spike capacity — tail latency explodes during every burst regardless of scaling policy.

Voyage AI names this directly for query-embedding inference:

"Query traffic is pretty spiky, so autoscaling is too slow."

Causes

Common sources of spiky traffic in production systems:

  • Human-driven product workflows — search queries, chatbot prompts, page-load bursts triggered by announcements / news / notifications.
  • Scheduled clients — cron-triggered batch jobs, Monday-morning login surges, hourly ingestion pipelines.
  • Retry storms — downstream outage triggers synchronised client retries; the retry spike can exceed steady-state load.
  • Viral events — social-media-driven traffic to a specific URL / product / feature.
  • Upstream failover — primary region outage shifts all traffic to secondaries within seconds.

Embedding-inference workloads inherit spikiness from the upstream product (search / retrieval / recommendation query frontends).

Mechanisms for absorbing spiky traffic

Since horizontal scale-out can't keep up, practical responses focus on in-place absorption:

  • Batching — bursts of similar requests compose into single operations. Canonical instance: token-count-based batching for embedding inference — bursts of short queries merge into one GPU forward pass instead of serialising through the GPU at low MFU.
  • Queue-in-front — an explicit queue (Redis, SQS, Kafka) decouples burst absorption from processing. Bursts fill the queue; processing drains it at steady rate. Tail latency during bursts becomes queue-wait latency (bounded by queue depth + TTL) rather than timeout at the server.
  • Pre-warmed hot capacity — warm pool of replicas kept idle so the response to a burst is dispatch rather than startup. Pays baseline cost; eliminates startup penalty.
  • Admission control — explicit rejection / 429 on burst so clients see fast failure rather than slow tail; caller-side retry budget + backoff.
  • Backpressure (concepts/backpressure) — signal upstream to slow down when downstream queues fill; converts load-shedding problem into cooperation protocol.

Why batching is strictly better than autoscaling for spiky

inference

Voyage AI's 2025-12-18 result makes the quantitative case. The production rollout of token-count batching on top of padding-removal vLLM achieves:

  • P90 end-to-end latency more stable during traffic spikes, even with fewer GPUs.
  • 50 % GPU-inference-latency reduction, 3× fewer GPUs for voyage-3-large serving.

The GPU that was under-utilised at steady state (memory-bound / low MFU on short queries) becomes the burst-absorption capacity because batching converts stranded memory-bound cycles into useful compute during bursts. Autoscaling was never going to deliver this — it can only add capacity slowly, not use existing capacity better.

Predictable vs. aperiodic spiky traffic

MongoDB's 2026-04-07 predictive-auto-scaling retrospective (Source: sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment) refines the taxonomy by naming a predictable-spiky subclass: bursts that recur on a daily or weekly cycle, or that follow a steady rising trend. For those, the structural force changes:

  • Aperiodic spikes (viral events, retry storms, unannounced batch jobs) → batching / queue-in-front / admission control / pre-warmed hot capacity are the only levers; autoscaling is structurally too slow.
  • Predictable spikes forecast them ahead of time and scale up before they arrive, so new capacity is live when the spike hits. The autoscaling-too-slow force doesn't apply because the action starts before the spike-that-would-trigger-it is visible.

Both subclasses are "spiky traffic" in the burst-shorter-than-reaction-time sense; they're distinguished by whether the pattern is short-horizon predictable. Predictive auto-scaling converts a predictable-spiky workload from "tail latency blows up during every burst" to "the right capacity is already there." MongoDB reports most Atlas replica sets are daily-seasonal, ~25% weekly-seasonal — i.e., the majority of their predictable-spiky population is forecast-addressable.

Seen in

  • 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; short query-embedding traffic named as "pretty spiky, so autoscaling is too slow"; token-count batching is the design response (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  • 2026-04-07 MongoDB — Predictive auto-scaling: an experiment — predictable-spiky subclass at the MongoDB Atlas replica-set capacity layer. Forecasting (MSTL + ARIMA over weeks of customer-driven metrics) converts daily / weekly seasonal spikes into pre-emptive scaling rather than in-place absorption. Contrast with the 2025-12-18 Voyage AI post: same architectural force (scaling latency > burst duration) resolved at a different layer (capacity-tier forecasting vs. inference-layer in-place batching), because the two workloads sit at different timescales (replica-set ~minutes, GPU-embedding-inference ~seconds). (sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment)

Last updated · 200 distilled / 1,178 read