Skip to content

CONCEPT Cited by 1 source

Late-arriving data

Definition

Late-arriving data is event data that reaches the analytics / streaming system after the "wall clock" bucket that the event belongs to has already started being processed or queried. Events can be delayed for many reasons — network partitions, client buffering, mobile offline → online syncs, upstream backpressure, processing lag in the ingestion pipeline. The result is that the value of a recent time bucket is not yet final — more events may still land in it.

Two forcing-function properties

  1. Recent buckets are unstable. A data point from 30 seconds ago might still change as late-arriving events trickle in.
  2. Old buckets are final. A data point from 30+ minutes ago is almost certainly settled.

This asymmetry is the forcing function behind several real-time- analytics design choices.

Design consequences

(a) Query-time watermarks

Many operational dashboards explicitly set the right boundary to now - 1m or now - 5s to avoid the "flappy tail" of currently-arriving data flickering across refreshes. Netflix's Druid-cache post names this pattern directly (Source: sources/2026-04-06-netflix-stop-answering-the-same-question-twice-interval-aware-caching-for-druid).

(b) Age-based exponential TTLs in caches

Because recent buckets are unstable and old buckets are final, a cache that serves time-series data can safely cache old buckets for a long time and must refresh recent buckets quickly. The canonical wiki instance is Netflix's Druid cache5 s TTL for <2-min-old buckets, doubling per additional minute of age, capped at 1 hour.

(c) Druid's cache conservatism

Apache Druid itself refuses to cache query results that involve realtime segments (still-being-indexed buckets) in its built-in full-result cache, precisely because late arrivals could invalidate the result. Druid prefers deterministic, stable cache results and query correctness over a higher hit rate (Source: sources/2026-04-06-netflix-stop-answering-the-same-question-twice-interval-aware-caching-for-druid).

(d) Trailing-bucket exception in negative caching

Netflix's Druid cache will negative-cache empty interior buckets (genuinely sparse metrics) but not empty trailing buckets — because an empty trailing bucket might just be data that hasn't arrived yet.

(e) Streaming-system watermarks

The streaming-system world (Flink, Spark Structured Streaming, Dataflow / Beam) has the same problem at an earlier layer: watermarks and allowed-lateness are how stream processors bound how long to wait for late events before closing a window. Netflix's Druid cache is solving a downstream version of the same problem at the query / dashboard layer rather than the aggregation layer.

Netflix's pipeline latency number

Netflix's end-to-end data pipeline latency is <5 s at P90 (Source: sources/2026-04-06-netflix-stop-answering-the-same-question-twice-interval-aware-caching-for-druid). This number bounds how much late-arrival ambiguity is likely: after 5 s, most events that were ever going to arrive have arrived. That bound is what justifies the cache's 5 s minimum TTL on the freshest bucket — the cache adds no meaningful staleness on top of what the pipeline's P90 lag is already introducing.

Relationship to ordering

Late-arriving-data is not the same as out-of-order events on the same key — but they co-occur. Late arrivals are usually also out-of-order with respect to wall-clock ingestion order, and event-time-ordered processing (rather than arrival-time-ordered) is the standard mitigation. The concepts/last-write-wins page notes "late-arriving older events don't clobber newer values — LWW is aligned with the application's ordering, not the arrival order" — which is the same trade-off seen from the write-side.

Seen in

Last updated · 319 distilled / 1,201 read