CONCEPT Cited by 3 sources

Power of Two Choices (P2C)¶

Definition¶

Power of Two Choices (P2C) is a randomised load-balancing primitive that picks two backends uniformly at random for each request and then deterministically routes to the one with lower observed load (typically active in-flight requests). It is the load-balancing concept; the P2C pattern page documents its productionised use.

The "concept" framing here is the balls-in-bins probabilistic result that motivates the algorithm; the "pattern" framing is the engineering recipe (where to embed it, what signal to use, when to fall back to consistent hashing).

The probabilistic result¶

Mitzenmacher's PhD thesis (Harvard, 1996) shows that with N backends, throwing balls into bins via:

Algorithm	Expected max load
Pure random (one bin per ball)	`Θ(log N / log log N)`
Power of Two Choices (sample two, pick lighter)	`Θ(log log N / log 2)` ≈ ~constant for practical `N`

Two random samples drop the worst-case max-load from log-over-log-log to log-log — an exponential improvement at near-zero overhead.

The intuition: with pure random, several consecutive bad-luck draws can pile load onto the same backend before the system notices. With P2C, every decision is made against a sample of two, so local imbalances get corrected continuously. No state is shared between deciders; the algorithm is fully stateless per request.

Why it works in practice¶

Asymptotically near-optimal load balance with O(1) sampling cost. No coordination, no shared state, no broadcasting load signals between clients.
Stateless per-request decision. A new client can join the pool and start making correct decisions immediately, given visibility into endpoint health (e.g. via EDS or xDS).
Robust to bursty traffic. Each request is sampled independently, so a thundering herd does not pile on a single unlucky backend the way pure-random would.
Defeats round-robin's hidden bias at high QPS: round-robin is correct only if requests have uniform cost and arrival inter-arrival times. Real workloads violate both, and round- robin's hotspots become visible at the tail.

When to use it¶

Default LB strategy for stateless or session-less services with no caller-shard requirement.
Services where the caller has a per-backend observable load signal — the standard one is active in-flight requests; can also be CPU utilisation, recent-latency EWMA, queue depth, etc.
High-QPS environments where round-robin starts to show tail-latency hotspots from non-uniform request cost.

When it's not enough¶

Stickiness required (session affinity, cache locality per key) → use concepts/consistent-hashing or bounded-load consistent hashing instead. Same fleet can run different LB algorithms per endpoint.
Shard-aware routing (the request must go to the owner of a particular key) → key-based dispatch.
Long-tailed request cost — if one in-flight request can equal 100 short ones, "active requests" is a misleading load signal. Mitigations: weight by expected cost, use a latency-EWMA signal, or move to a feedback-controlled scheduler (see concepts/feedback-control-load-balancing).

Canonical wiki disclosures¶

Databricks intra-cluster Armeria RPC (2025-10-01)¶

P2C is Databricks' default client-side LB algorithm, embedded in their Armeria RPC library. The Endpoint Discovery Service (EDS) streams endpoint state to the Armeria client, and P2C decides over that state per request:

"P2C strikes a strong balance between performance and implementation simplicity, consistently leading to uniform traffic distribution across endpoints." (Source: sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing)

More exotic strategies (CPU-skew-routed, multi-metric weighted) were tried and retracted in favour of P2C + direct health signals — a rare published "we tried fancy and went back to simple" datum.

Databricks Model Serving — 200K QPS production validation (2026-05-08)¶

The 2026-05-08 Databricks / Superhuman post promotes the same P2C algorithm out of intra-cluster RPC and into managed external inference at 200,000+ QPS sustained:

"While the default Kubernetes round robin load balancer is sufficient at low QPS, our tests revealed that this performance degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency."

"For each request, two candidate pods are sampled and traffic is routed to whichever has fewer active requests, preventing the hotspots that round-robin creates at high QPS."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The "active requests" signal is the per-pod request_concurrency counter — the same primitive the autoscaler tracks for capacity decisions. P2C reads it for routing; the autoscaler reads it averaged-across-pods for scaling. Same signal, two control loops.

This is the wiki's first canonical production-validation datum for P2C at the 200K-QPS-sub-1s-p99 envelope on a managed inference platform, with explicit framing that default Kubernetes round-robin fails at this scale.

Databricks LLM serving — P2C retired for LLM workloads (2026-05-27)¶

The 2026-05-27 Reliable LLM Inference at Scale post explicitly retires P2C-with-active-requests for LLM serving in favour of cost-based routing on model units:

"In general, load balancing tends to lean on statistical approaches like P2C (power of two choices), which estimate load based on queue size and leverage sampling to reduce the memory and latency overheads of understanding all the possible targets. However, LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach. Today, we use Dicer, Databricks' auto-sharder, to dynamically route workloads across servers."

(Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale)

The structural argument (latency × server count × misrouting cost) is the wiki's first canonical regime-change datum: P2C's correctness depends on regime properties. P2C remains correct in classical CPU serving (high server count, low base latency, low misrouting cost) — but at LLM scale on small fleets of frontier-GPU pods with seconds-scale request latency, the balls-in-bins probabilistic guarantee no longer dominates the cost-of-misrouting penalty.

The replacement: load-aware routing on Dicer keyed on per-pod model-unit load rather than per-pod active-request count, in Axon, the LLM data-plane router. See patterns/cost-based-load-balancing-llm, concepts/non-uniform-llm-request-cost, concepts/model-units.

The Databricks corpus now hosts three canonical P2C disclosures at three regimes: intra-cluster RPC (use P2C), managed CPU-style serving at 200K QPS (use P2C), LLM serving on frontier GPUs (do not use P2C-with-active-requests). Same team, same platform, explicitly different choice per regime.

Composition¶

Above P2C in the stack: zone-affinity routing (patterns/zone-affinity-routing when present) — P2C within the preferred zone, spill over to other zones on health failure.
Below P2C: the endpoint state from EDS / xDS / client-side LB.
Composable with feedback control: P2C is reactive (per- request stateless), feedback-controlled scheduling is convergent (state across requests). Running P2C inside feedback-controlled per-pod weights is a plausible composition.

Canonical implementations¶

Envoy LEAST_REQUEST LB policy uses P2C under the hood.
Databricks Armeria RPC client — see the 2025-10-01 post.
Databricks Model Serving — see the 2026-05-08 post.
(Many service meshes / API gateways also implement P2C; the wiki pattern page tracks the production embeddings.)

Reference¶

M. Mitzenmacher, The Power of Two Random Choices: A Survey of Techniques and Results, in Handbook of Randomized Computing, S. Rajasekaran, P. Pardalos, J. Reif, J. Rolim (eds.), Kluwer Academic, 2001. (Harvard PDF)
M. Mitzenmacher, PhD thesis, Berkeley 1996. (Harvard PDF)

Seen in¶

sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing — canonical algorithmic write-up + the we-tried-fancy-and-went- back-to-P2C datum.
sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical 200K-QPS production validation datum, with explicit Kubernetes-round-robin-degrades-at-high-QPS framing.
sources/2026-05-27-databricks-reliable-llm-inference-at-scale — regime-change retirement datum: P2C-with-active-requests is explicitly not the right primitive for LLM serving on frontier GPUs at 125T+ tokens/month. Replaced by Dicer-keyed-on-model-units in Axon. Argument: high LLM latency × low server count × heavy misrouting cost.

patterns/power-of-two-choices — the productionised pattern.
patterns/cost-based-load-balancing-llm — the LLM-regime replacement for P2C-with-active-requests.
concepts/model-units — the load currency that replaces active-request-count for LLM serving.
concepts/non-uniform-llm-request-cost — the structural property that breaks count-based load metrics at LLM scale.
concepts/token-aware-load-balancing — the structurally similar Cloudflare primitive.
concepts/client-side-load-balancing — the deployment model.
concepts/round-robin-load-balancing — the alternative P2C displaces under load.
concepts/feedback-control-load-balancing — the convergent alternative.
concepts/consistent-hashing / patterns/bounded-load-consistent-hashing — used when stickiness is required instead.
concepts/request-concurrency-as-autoscaling-signal — same primitive used for routing in P2C and for capacity in the autoscaler.
systems/databricks-endpoint-discovery-service — the LB control plane that streams endpoint state to P2C clients.
systems/databricks-axon — the LLM-specific router that replaces P2C-with-active-requests at LLM scale.
systems/dicer — the routing substrate Axon is built on.
systems/envoy — LEAST_REQUEST is Envoy's P2C implementation.