High Scalability — Consistent hashing algorithm¶
Summary¶
A canonical textbook-style explainer of consistent hashing as a cache / storage partitioning primitive, republished by High Scalability from systemdesign.one. The post walks through the partitioning problem space (random / single global cache / key-range / static hash / consistent hashing), demonstrates why the first four don't scale under dynamic load, then derives consistent hashing and its operational properties: average k/N keys moved when N changes, logarithmic-n lookup via self-balancing BST, virtual nodes to smooth load, and two popular optimisations — multi-probe (Google, 2015) and bounded-load (Google, 2017). Closes with real-world deployments: Memcached clients (Ketama), Amazon Dynamo / DynamoDB / Apache Cassandra / Riak, Vimeo load-balancing, Netflix Open Connect CDN, Discord server-to-node mapping, and HAProxy bounded-load LB.
Key takeaways¶
-
The partitioning problem has five canonical candidates — four of them don't scale. Random assignment yields uniform distribution but clients can't locate keys; single global cache collapses availability; key-range partitioning produces non-uniform loads; static
hash(k) mod Nis O(1) but "the addition or removal of a node breaks the mapping between keys and nodes … massive data movement when the number of nodes changes" — on failure most requests become cache misses, which "might swamp and degrade" the origin server. Consistent hashing is the one that does scale. -
The mechanism in four operations (Source: post's numbered list, reproduced verbatim): (i) hash-function output placed on a virtual ring (hash ring); (ii) hashed node IPs assign node positions on the ring; (iii) key hashed with the same function; (iv) ring traversed clockwise from the key's position until a node is found. "The first node with a position value greater than the position of the key stores the data object." Each node owns the ring arc between itself and its predecessor.
-
The operational property that earns its name —
k/Nmovement on membership change. "The deletion or addition of a node results in the movement of an average number of keys stored on a single node"; remaining nodes on the ring are unaffected. Average keys per node = k/N (k = total keys, N = nodes). This is the load-bearing property behind every consistent-hashing deployment the post names. -
Self-balancing BST as the canonical implementation, with centralized vs gossip topology choice. Node positions live in a BST: O(log n) for lookup / insert / delete. "The BST data structure is stored on a centralized highly available service. As an alternative, the BST data structure is stored on each node, and the state information between the nodes is synchronized through the gossip protocol" — Dynamo's choice. Concurrency handled with a readers-writer lock, "at the expense of a slight increase in latency."
-
Hash-function choice is fast-and-uniform, not cryptographic. MD5 / SHA-1 / SHA-256 are "not relatively fast." MurmurHash, xxHash, MetroHash, SipHash1-3 are preferred. The hash-ring address space must be large enough to prevent collisions ("reasonable size"); worked example uses a 2¹⁰ = 1024-position ring.
-
Virtual nodes fix hot-spot risk, not just load balance. "There is a chance that nodes are not uniformly distributed on the consistent hash ring. The nodes that receive a huge amount of traffic become hotspots resulting in cascading failure of the nodes." Virtual nodes — assigning each physical node multiple ring positions via distinct hash functions — smooth key distribution. Capacity-proportional assignment handles heterogeneity: "the nodes with a higher capacity are assigned more positions on the hash ring."
-
Virtual nodes have real costs. Drawbacks enumerated: "when a specific data object becomes extremely popular, consistent hashing will still send all the requests for the popular data object to the same subset of nodes"; capacity planning is trickier with virtual nodes; memory + operational complexity rises from BST maintenance; replication is harder — logic needed to identify distinct physical nodes; downtime of a virtual node affects multiple nodes on the ring (because one physical failure removes many ring positions simultaneously).
-
Multi-probe consistent hashing — linear space, no virtual nodes (Appleton + O'Reilly, Google, 2015). "Linear O(n) space complexity to store the positions of nodes on the hash ring. There are no virtual nodes but a node is assigned only a single position on the hash ring. The amortized time complexity for the addition and removal of nodes is constant O(1). However, the key (data object) lookups are relatively slower." Mechanism: hash the key multiple times via distinct hash functions on lookup; closest node clockwise wins.
-
Consistent hashing with bounded loads — fallback-node list on overload (Mirrokni + Thorup + Zadimoghaddam, Google, 2017). Caps load per node relative to cluster average. When a node is overloaded, requests spill to deterministic fallback nodes (same hash → same fallback list), preserving cache-warmth for popular keys. Solves the "popular data object sends all requests to the same subset of nodes" drawback the post named earlier. Deployed in HAProxy for load balancing.
-
Production deployments, in the post's own words. The five named examples are broadly cross-cited elsewhere on this wiki: Discord server-to-node mapping (Elixir chat rooms); Amazon DynamoDB + Apache Cassandra + Riak (dynamic data-set partitioning, Dynamo paper lineage); Vimeo video load-balancing (Engineering Blog, 2016 — their bounded-load publication); Netflix Open Connect CDN content distribution; Memcached clients (Ketama) + HAProxy bounded-load LB + out-of-the-box client support.
Architectural details¶
Complexity table (copied verbatim):
| Operation | Time complexity | Description |
|---|---|---|
| Add a node | O(k/n + log n) | O(k/n) for redistribution of keys, O(log n) for BST traversal |
| Remove a node | O(k/n + log n) | O(k/n) for redistribution, O(log n) for BST traversal |
| Add a key | O(log n) | O(log n) for BST traversal |
| Remove a key | O(log n) | O(log n) for BST traversal |
Storage topology: BST on centralised HA service or BST replicated on each node with gossip sync. Dynamo's choice is the latter — peer-to-peer membership without a central coordinator.
Concurrency: readers-writer lock around the BST when multiple membership changes land at once.
Failure semantics:
- Node crash → data moves to the immediate clockwise neighbour. Other nodes untouched.
- Node add → the new node pulls keys from its immediate clockwise neighbour (those that fall in its new range).
- Replication to adjacent nodes (ring neighbours) minimises movement on crash and restores durability without full re-shuffle.
Numbers disclosed¶
- Average keys per node: k/N where k = total keys, N = nodes.
- Worked example ring size: 2¹⁰ = 1024 positions.
- Multi-probe: linear O(n) space, amortized O(1) add/remove, slower lookup (multiple hashes).
- Consistent-hashing BST: O(log n) lookup / insert / delete.
- Static hash: O(1) lookup, but unbounded movement on resize.
Numbers NOT disclosed¶
- No specific k/N → concrete movement number for any named production deployment.
- No cache hit-rate numbers under virtual-node vs no-virtual-node load distribution.
- No hotspot frequency data — only the qualitative "chance that nodes are not uniformly distributed."
- No fallback-depth numbers for bounded-load consistent hashing (how deep a fallback list is in practice).
- No virtual-node multiplier guidance (how many positions per physical node the named deployments use — Dynamo paper says ~100-500 but the post doesn't cite).
- No comparison of MurmurHash vs xxHash vs MetroHash vs SipHash1-3 throughput under consistent-hashing workloads.
Caveats¶
- Textbook-explainer voice, not production retrospective. The post is a primer republished from systemdesign.one; it's aimed at system-design interviewees, not SREs running the thing. Named deployments are cited but not quantified.
- Static-hash-partitioning rejection is load-bearing but incomplete. Static
hash(k) mod Nalso loses its guarantees when N changes, true — but rendezvous hashing (HRW) and jump consistent hashing are alternative non-ring consistent-hashing families the post doesn't mention. Rendezvous is commonly cited alongside ring consistent hashing today; its omission is a notable gap. - Virtual-node drawbacks list reads as "five reasons to be skeptical" but the industry-wide verdict is that virtual nodes are worth it in almost every deployment. The post is technically accurate; the rhetorical framing slightly undersells the payoff.
- No threat model / security-posture discussion. MurmurHash / xxHash / MetroHash are non-cryptographic hashes — they work fine for partitioning but are defensible-against-adversarial-skew only when the cluster boundary is trusted. The post doesn't warn that user-controlled keys + non-cryptographic hash = steerable hot-spotting.
- The Netflix Open Connect citation is thin. The linked 2017 post is about CDN content distribution, where consistent hashing is one of many primitives; describing Open Connect as "uses consistent hashing" is accurate but not the system's architectural centre of gravity.
- Ring-size guidance is missing. The worked 1024-position example is pedagogically useful but will collision-bomb a real production ring. Real deployments use 32-bit or 64-bit ring address spaces. Post doesn't surface this.
- Discord example is phrased as static server-to-node mapping. Discord's scaling story (ingested on this wiki via the [Fly.io corrosion post's BEAM/Elixir citation chain]) is about Guilded Pods / Actor-per-Guild. Consistent hashing is the routing primitive in front of the Actor layer. The post reduces this to a single figure.
Relationship to existing wiki¶
This post is the canonical distributed-systems primer for the concepts/consistent-hashing concept page. Prior to this ingest, concepts/consistent-hashing was skewed toward a narrow application — percentage rollouts in feature flags via the Cloudflare Flagship source (2026-04-17). This source provides the original sharding / caching framing from which the feature-flag application is derived: the same "adding a bucket keeps most keys where they were" property that makes a 5 % → 10 % rollout monotonic is what keeps a cache warm across a cluster resize.
Extends on the wiki:
- concepts/consistent-hashing — adds the canonical "here's why static hash mod N fails and here's how ring + BST + virtual-nodes composes to fix it" framing + the two post-Karger-1997 variants (multi-probe, bounded-load) + the full Dynamo / Cassandra / Riak / Memcached-Ketama / Netflix Open Connect / Discord / Vimeo / HAProxy production lineage.
- concepts/horizontal-sharding — canonical citation for "one way to assign keys to shards is consistent hashing"; the Figma
UserID / FileID / OrgIDcolos are a shard-key-first variant that deliberately does not use consistent hashing at the application layer, because the colos are stable. Useful contrast. - concepts/gossip-protocol — Dynamo's gossip-on-BST-membership is one of the canonical use-shapes for gossip; this post is the Dynamo-side citation that pairs with the Fly.io Corrosion gossip post from the link-state-routing angle.
- systems/dynamodb — DynamoDB inherits consistent hashing from its Dynamo paper ancestor; Seen-in note added.
Creates:
- companies/highscalability — first per-company page for High Scalability, a Tier-1 meta-aggregator blog (RSS) that republishes distilled system-design content from across the industry.
- concepts/hash-ring — the ring structure as a first-class wiki concept, pulled out of the consistent-hashing page for reuse.
- concepts/hotspot — performance-degraded node taking disproportionate traffic; named here but applicable across caching, DB partitioning, and LB.
- patterns/virtual-nodes-for-load-balancing — one physical node → many ring positions; the canonical corrective to raw consistent-hashing's non-uniformity.
- patterns/multi-probe-consistent-hashing — Google 2015 paper's O(n) space + no-virtual-nodes variant.
- patterns/bounded-load-consistent-hashing — Google 2017 paper's fallback-list spillover variant.
- systems/libketama — last.fm's Memcached-client consistent-hashing library (2007); one of the earliest open-source productisations.
- systems/amazon-dynamo — the 2007 Dynamo paper (DeCandia et al., SOSP 2007); distinct from the commercial product DynamoDB; canonical wiki anchor for the paper itself.
Source¶
- Original: https://highscalability.com/consistent-hashing-algorithm/
- Raw markdown:
raw/highscalability/2023-02-22-consistent-hashing-algorithm-b5f0a708.md
Related¶
- concepts/consistent-hashing — the concept page; this post is its canonical distributed-systems primer.
- concepts/hash-ring — the ring topology structure.
- concepts/hotspot — the performance-degradation mode virtual nodes + bounded-load solve.
- concepts/gossip-protocol — Dynamo's BST-membership sync mechanism.
- concepts/horizontal-sharding — the broader problem class.
- patterns/virtual-nodes-for-load-balancing — canonical corrective for non-uniform ring load.
- patterns/multi-probe-consistent-hashing — Google 2015 variant.
- patterns/bounded-load-consistent-hashing — Google 2017 variant.
- systems/libketama — last.fm's Memcached-client consistent-hashing library.
- systems/amazon-dynamo — the 2007 Dynamo paper; canonical production origin.
- systems/dynamodb — commercial descendant; inherits partitioning via consistent hashing.
- companies/highscalability — source publication.