CONCEPT Cited by 1 source
Consul streaming vs. long-polling¶
Definition¶
A concrete case study in how a concurrency-primitive-design choice made during performance optimization regressed catastrophically under extreme load. Documented in the October 2021 Roblox outage post-mortem, summarized in the High Scalability Dec-2022 roundup.
The two mechanisms¶
Long-polling (older Consul read model):
- Each client holds a connection open to Consul, waiting for a KV change notification.
- Internally implemented using multiple Go channels — one per subscription-like unit.
- At high load, contention is spread across the Go channels.
Streaming (newer Consul feature, Roblox-era 2021):
- A more efficient overall read model — shares data from one stream source to many subscribers.
- Internally uses fewer Go channels as concurrency-control elements.
- At moderate load, streaming is more efficient than long-polling.
The failure mode¶
Under very high simultaneous read and write load — specifically the profile Roblox hit after enabling streaming on their traffic-routing tier and concurrently growing its node count 50% — contention collapsed onto a single Go channel, which blocked Consul KV writes.
Outcome:
- Streaming becomes less efficient than long-polling at that load point.
- Consul KV writes block for tens of seconds.
- Because Roblox ran all backend services on a single Consul cluster, the blocked writes cascaded into a 73-hour fleet-wide outage.
Rolling streaming back across all Consul systems dropped KV write P50 from "blocked for long periods" to 300 ms.
Lessons¶
- An optimization validated at moderate concurrency can regress at extreme concurrency, especially when it consolidates a concurrency-control primitive that was previously sharded.
- Feature enablement + infrastructure scale-up done together obscures the root cause: was it the feature or the scale? Roblox had to disable the feature fleet-wide to be sure.
- Operational hygiene: when rolling out a new read model on a control-plane service, ramp slowly and watch the write path, not just the read path. Streaming was "more efficient" on read — but the write path was what died.
- Single-cluster = single-point-of-failure. Roblox's post-outage remediation was a second geographically-distinct datacenter + multi-AZ within each DC.
Seen in¶
- sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022
- systems/roblox-hashistack