Skip to content

HIGHSCALABILITY 2023-02-22 Tier 1

Read original ↗

High Scalability — Consistent hashing algorithm

Summary

A canonical textbook-style explainer of consistent hashing as a cache / storage partitioning primitive, republished by High Scalability from systemdesign.one. The post walks through the partitioning problem space (random / single global cache / key-range / static hash / consistent hashing), demonstrates why the first four don't scale under dynamic load, then derives consistent hashing and its operational properties: average k/N keys moved when N changes, logarithmic-n lookup via self-balancing BST, virtual nodes to smooth load, and two popular optimisations — multi-probe (Google, 2015) and bounded-load (Google, 2017). Closes with real-world deployments: Memcached clients (Ketama), Amazon Dynamo / DynamoDB / Apache Cassandra / Riak, Vimeo load-balancing, Netflix Open Connect CDN, Discord server-to-node mapping, and HAProxy bounded-load LB.

Key takeaways

  1. The partitioning problem has five canonical candidates — four of them don't scale. Random assignment yields uniform distribution but clients can't locate keys; single global cache collapses availability; key-range partitioning produces non-uniform loads; static hash(k) mod N is O(1) but "the addition or removal of a node breaks the mapping between keys and nodes … massive data movement when the number of nodes changes" — on failure most requests become cache misses, which "might swamp and degrade" the origin server. Consistent hashing is the one that does scale.

  2. The mechanism in four operations (Source: post's numbered list, reproduced verbatim): (i) hash-function output placed on a virtual ring (hash ring); (ii) hashed node IPs assign node positions on the ring; (iii) key hashed with the same function; (iv) ring traversed clockwise from the key's position until a node is found. "The first node with a position value greater than the position of the key stores the data object." Each node owns the ring arc between itself and its predecessor.

  3. The operational property that earns its name — k/N movement on membership change. "The deletion or addition of a node results in the movement of an average number of keys stored on a single node"; remaining nodes on the ring are unaffected. Average keys per node = k/N (k = total keys, N = nodes). This is the load-bearing property behind every consistent-hashing deployment the post names.

  4. Self-balancing BST as the canonical implementation, with centralized vs gossip topology choice. Node positions live in a BST: O(log n) for lookup / insert / delete. "The BST data structure is stored on a centralized highly available service. As an alternative, the BST data structure is stored on each node, and the state information between the nodes is synchronized through the gossip protocol" — Dynamo's choice. Concurrency handled with a readers-writer lock, "at the expense of a slight increase in latency."

  5. Hash-function choice is fast-and-uniform, not cryptographic. MD5 / SHA-1 / SHA-256 are "not relatively fast." MurmurHash, xxHash, MetroHash, SipHash1-3 are preferred. The hash-ring address space must be large enough to prevent collisions ("reasonable size"); worked example uses a 2¹⁰ = 1024-position ring.

  6. Virtual nodes fix hot-spot risk, not just load balance. "There is a chance that nodes are not uniformly distributed on the consistent hash ring. The nodes that receive a huge amount of traffic become hotspots resulting in cascading failure of the nodes." Virtual nodes — assigning each physical node multiple ring positions via distinct hash functions — smooth key distribution. Capacity-proportional assignment handles heterogeneity: "the nodes with a higher capacity are assigned more positions on the hash ring."

  7. Virtual nodes have real costs. Drawbacks enumerated: "when a specific data object becomes extremely popular, consistent hashing will still send all the requests for the popular data object to the same subset of nodes"; capacity planning is trickier with virtual nodes; memory + operational complexity rises from BST maintenance; replication is harder — logic needed to identify distinct physical nodes; downtime of a virtual node affects multiple nodes on the ring (because one physical failure removes many ring positions simultaneously).

  8. Multi-probe consistent hashing — linear space, no virtual nodes (Appleton + O'Reilly, Google, 2015). "Linear O(n) space complexity to store the positions of nodes on the hash ring. There are no virtual nodes but a node is assigned only a single position on the hash ring. The amortized time complexity for the addition and removal of nodes is constant O(1). However, the key (data object) lookups are relatively slower." Mechanism: hash the key multiple times via distinct hash functions on lookup; closest node clockwise wins.

  9. Consistent hashing with bounded loads — fallback-node list on overload (Mirrokni + Thorup + Zadimoghaddam, Google, 2017). Caps load per node relative to cluster average. When a node is overloaded, requests spill to deterministic fallback nodes (same hash → same fallback list), preserving cache-warmth for popular keys. Solves the "popular data object sends all requests to the same subset of nodes" drawback the post named earlier. Deployed in HAProxy for load balancing.

  10. Production deployments, in the post's own words. The five named examples are broadly cross-cited elsewhere on this wiki: Discord server-to-node mapping (Elixir chat rooms); Amazon DynamoDB + Apache Cassandra + Riak (dynamic data-set partitioning, Dynamo paper lineage); Vimeo video load-balancing (Engineering Blog, 2016 — their bounded-load publication); Netflix Open Connect CDN content distribution; Memcached clients (Ketama) + HAProxy bounded-load LB + out-of-the-box client support.

Architectural details

Complexity table (copied verbatim):

Operation Time complexity Description
Add a node O(k/n + log n) O(k/n) for redistribution of keys, O(log n) for BST traversal
Remove a node O(k/n + log n) O(k/n) for redistribution, O(log n) for BST traversal
Add a key O(log n) O(log n) for BST traversal
Remove a key O(log n) O(log n) for BST traversal

Storage topology: BST on centralised HA service or BST replicated on each node with gossip sync. Dynamo's choice is the latter — peer-to-peer membership without a central coordinator.

Concurrency: readers-writer lock around the BST when multiple membership changes land at once.

Failure semantics:

  • Node crash → data moves to the immediate clockwise neighbour. Other nodes untouched.
  • Node add → the new node pulls keys from its immediate clockwise neighbour (those that fall in its new range).
  • Replication to adjacent nodes (ring neighbours) minimises movement on crash and restores durability without full re-shuffle.

Numbers disclosed

  • Average keys per node: k/N where k = total keys, N = nodes.
  • Worked example ring size: 2¹⁰ = 1024 positions.
  • Multi-probe: linear O(n) space, amortized O(1) add/remove, slower lookup (multiple hashes).
  • Consistent-hashing BST: O(log n) lookup / insert / delete.
  • Static hash: O(1) lookup, but unbounded movement on resize.

Numbers NOT disclosed

  • No specific k/N → concrete movement number for any named production deployment.
  • No cache hit-rate numbers under virtual-node vs no-virtual-node load distribution.
  • No hotspot frequency data — only the qualitative "chance that nodes are not uniformly distributed."
  • No fallback-depth numbers for bounded-load consistent hashing (how deep a fallback list is in practice).
  • No virtual-node multiplier guidance (how many positions per physical node the named deployments use — Dynamo paper says ~100-500 but the post doesn't cite).
  • No comparison of MurmurHash vs xxHash vs MetroHash vs SipHash1-3 throughput under consistent-hashing workloads.

Caveats

  • Textbook-explainer voice, not production retrospective. The post is a primer republished from systemdesign.one; it's aimed at system-design interviewees, not SREs running the thing. Named deployments are cited but not quantified.
  • Static-hash-partitioning rejection is load-bearing but incomplete. Static hash(k) mod N also loses its guarantees when N changes, true — but rendezvous hashing (HRW) and jump consistent hashing are alternative non-ring consistent-hashing families the post doesn't mention. Rendezvous is commonly cited alongside ring consistent hashing today; its omission is a notable gap.
  • Virtual-node drawbacks list reads as "five reasons to be skeptical" but the industry-wide verdict is that virtual nodes are worth it in almost every deployment. The post is technically accurate; the rhetorical framing slightly undersells the payoff.
  • No threat model / security-posture discussion. MurmurHash / xxHash / MetroHash are non-cryptographic hashes — they work fine for partitioning but are defensible-against-adversarial-skew only when the cluster boundary is trusted. The post doesn't warn that user-controlled keys + non-cryptographic hash = steerable hot-spotting.
  • The Netflix Open Connect citation is thin. The linked 2017 post is about CDN content distribution, where consistent hashing is one of many primitives; describing Open Connect as "uses consistent hashing" is accurate but not the system's architectural centre of gravity.
  • Ring-size guidance is missing. The worked 1024-position example is pedagogically useful but will collision-bomb a real production ring. Real deployments use 32-bit or 64-bit ring address spaces. Post doesn't surface this.
  • Discord example is phrased as static server-to-node mapping. Discord's scaling story (ingested on this wiki via the [Fly.io corrosion post's BEAM/Elixir citation chain]) is about Guilded Pods / Actor-per-Guild. Consistent hashing is the routing primitive in front of the Actor layer. The post reduces this to a single figure.

Relationship to existing wiki

This post is the canonical distributed-systems primer for the concepts/consistent-hashing concept page. Prior to this ingest, concepts/consistent-hashing was skewed toward a narrow application — percentage rollouts in feature flags via the Cloudflare Flagship source (2026-04-17). This source provides the original sharding / caching framing from which the feature-flag application is derived: the same "adding a bucket keeps most keys where they were" property that makes a 5 % → 10 % rollout monotonic is what keeps a cache warm across a cluster resize.

Extends on the wiki:

  • concepts/consistent-hashing — adds the canonical "here's why static hash mod N fails and here's how ring + BST + virtual-nodes composes to fix it" framing + the two post-Karger-1997 variants (multi-probe, bounded-load) + the full Dynamo / Cassandra / Riak / Memcached-Ketama / Netflix Open Connect / Discord / Vimeo / HAProxy production lineage.
  • concepts/horizontal-sharding — canonical citation for "one way to assign keys to shards is consistent hashing"; the Figma UserID / FileID / OrgID colos are a shard-key-first variant that deliberately does not use consistent hashing at the application layer, because the colos are stable. Useful contrast.
  • concepts/gossip-protocol — Dynamo's gossip-on-BST-membership is one of the canonical use-shapes for gossip; this post is the Dynamo-side citation that pairs with the Fly.io Corrosion gossip post from the link-state-routing angle.
  • systems/dynamodb — DynamoDB inherits consistent hashing from its Dynamo paper ancestor; Seen-in note added.

Creates:

Source

Last updated · 319 distilled / 1,201 read