SYSTEM Cited by 1 source
Roblox HashiStack (Consul + Nomad + Vault)¶
Overview¶
Roblox runs its core infrastructure on-prem, not on a public cloud. The 2021 Return-to-Service post-mortem (summarized in the High Scalability Dec-2022 roundup) disclosed the scale and the stack:
- 18,000 servers
- 170,000 containers
- Orchestration on the HashiCorp "HashiStack": Nomad (scheduler), Consul (service discovery + KV), Vault (secrets).
Roblox's economic argument for owning the infrastructure is explicit: "building and managing our own data centers for backend and network edge services, we have been able to significantly control costs compared to public cloud. These savings directly influence the amount we are able to pay to creators on the platform. Furthermore, owning our own hardware and building our own edge infrastructure allows us to minimize performance variations and carefully manage the latency of our players."
The October 2021 73-hour outage¶
Canonical case study for Consul streaming vs. long-polling load behavior under high read+write concurrency.
Timeline:
- 2021-10-27 14:00 — Roblox enabled Consul's (then new) streaming feature on a backend service responsible for traffic routing. Simultaneously, the node count of that traffic-routing tier was increased 50% to handle expected end-of-year traffic.
- 2021-10-28 — Major outage begins. Engineering team initial theory: increased traffic. Over the next ~10 hours, team works through Consul internals — debug logs, OS-level metrics. Data showed Consul KV writes getting blocked for long periods ("contention").
- 2021-10-28 15:51 — Streaming disabled across all Consul systems. Consul KV write P50 drops to 300 ms.
- 73 hours later — Full return to service, progressive DNS-steering-based traffic ramp-up in ~10% increments.
Root cause (per HashiCorp)¶
Streaming was more efficient than long-polling on average, but the implementation used fewer Go channels as concurrency primitives. Under very high simultaneous read and write load, the design concentrated contention onto a single Go channel, which blocked writes — making streaming less efficient than long-polling at that load point.
This is a textbook example of:
- An optimization validated at moderate concurrency that regresses at extreme concurrency in a concurrency-primitive- dependent way.
- Feature enablement + infrastructure scale-up done simultaneously, making which change caused the regression ambiguous.
- Single-cluster single-point-of-failure — running all Roblox backend services on one Consul cluster meant any Consul contention = fleet-wide outage.
Post-outage remediation¶
- Multi-cluster geographical distribution: "We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services."
- Multi-AZ within each DC: "We have efforts underway to move to multiple availability zones within these data centers."
- Progressive-ramp traffic-acceptance pattern (DNS-steered percentage-rollout) became the canonical restore playbook — see patterns/fast-rollback / patterns/staged-rollout.
Also cited¶
- "1 billion requests per second" handled by Roblox's caching system (from the same Return-to-Service post) — the cited Number Stuff datapoint in the roundup.