Skip to content

CONCEPT Cited by 1 source

Consul streaming vs. long-polling

Definition

A concrete case study in how a concurrency-primitive-design choice made during performance optimization regressed catastrophically under extreme load. Documented in the October 2021 Roblox outage post-mortem, summarized in the High Scalability Dec-2022 roundup.

The two mechanisms

Long-polling (older Consul read model):

  • Each client holds a connection open to Consul, waiting for a KV change notification.
  • Internally implemented using multiple Go channels — one per subscription-like unit.
  • At high load, contention is spread across the Go channels.

Streaming (newer Consul feature, Roblox-era 2021):

  • A more efficient overall read model — shares data from one stream source to many subscribers.
  • Internally uses fewer Go channels as concurrency-control elements.
  • At moderate load, streaming is more efficient than long-polling.

The failure mode

Under very high simultaneous read and write load — specifically the profile Roblox hit after enabling streaming on their traffic-routing tier and concurrently growing its node count 50% — contention collapsed onto a single Go channel, which blocked Consul KV writes.

Outcome:

  • Streaming becomes less efficient than long-polling at that load point.
  • Consul KV writes block for tens of seconds.
  • Because Roblox ran all backend services on a single Consul cluster, the blocked writes cascaded into a 73-hour fleet-wide outage.

Rolling streaming back across all Consul systems dropped KV write P50 from "blocked for long periods" to 300 ms.

Lessons

  1. An optimization validated at moderate concurrency can regress at extreme concurrency, especially when it consolidates a concurrency-control primitive that was previously sharded.
  2. Feature enablement + infrastructure scale-up done together obscures the root cause: was it the feature or the scale? Roblox had to disable the feature fleet-wide to be sure.
  3. Operational hygiene: when rolling out a new read model on a control-plane service, ramp slowly and watch the write path, not just the read path. Streaming was "more efficient" on read — but the write path was what died.
  4. Single-cluster = single-point-of-failure. Roblox's post-outage remediation was a second geographically-distinct datacenter + multi-AZ within each DC.

Seen in

Last updated · 319 distilled / 1,201 read