Skip to content

CONCEPT Cited by 1 source

Hash sharding

Hash sharding routes each row to a shard by hash(shard_key) — a deterministic hash function maps every shard-key value into a fixed output range, and each shard owns a sub-range of hash outputs. The default production sharding strategy in most sharded relational systems (including the standard Vitess hash Vindex configuration); one of the four sharding strategies enumerated by Ben Dicken (Source: sources/2026-04-21-planetscale-database-sharding).

Core property: even distribution from dissimilar outputs

Dicken's framing:

"The nice thing about hashes is that similar inputs can produce very different outputs. We might pass in the name joseph and get hash 45, but the name josephine produces 28. Similar names, completely different hashes. This means similar values may end up on totally different servers, a good property to help the data get evenly spread out." (Source: sources/2026-04-21-planetscale-database-sharding)

Hash sharding inherits distribution uniformity from the hash function — a property that holds regardless of whether the underlying shard-key values are uniformly distributed, monotonically increasing, or skewed. This is why hash sharding is the default production choice: it doesn't require the operator to know the value distribution in advance, and it stays balanced as the distribution drifts.

Core cost: scatter-gather on shard-key range scans

The flip side. WHERE shard_key BETWEEN a AND b fans out to every shard under hash sharding — sequential shard-key values hash to independent shards, so a range scan can't be narrowed to a subset. The query becomes scatter-gather / cross-shard. Acceptable when shard-key range scans are rare (the common OLTP case); problematic when they dominate (analytics, time-series, sequential batch iteration).

Pairs with high-cardinality shard keys

Hash sharding distributes well only if the shard key has enough cardinality that the hash outputs cover the range evenly. Very-skewed keys — e.g. a country_code column where 80% of rows are US — still concentrate: hash("US") is a single hash output, so every US row lands on one shard regardless of how many shards exist. Dicken's recommendation: "Often a column like user_id is a good choice because each value is unique. We also get the added benefit of hash speed. It's faster to hash a fixed-size integer as compared to a variable-width name string."

Relationship to consistent hashing

Consistent hashing is the specific family of hash-sharding algorithms where adding or removing a shard only re-maps 1/N of the keys (rather than re-hashing the whole dataset). Production databases almost always use consistent hashing as the hash-sharding implementation so that resharding is a small-delta operation, not a full rewrite. Range-sharding-on-hash-output (each shard owns a contiguous hash range) is a common implementation shape, because it composes with keyspace_id-style opaque shard addressing.

Cryptographic vs non-cryptographic hash

Dicken's primer uses "cryptographic hash" for pedagogical simplicity, but production routing generally uses fast non-cryptographic mixers — xxhash, CityHash, MurmurHash3, or the Vitess-specific hash Vindex producing a binary(8) keyspace_id. Cryptographic hashes (SHA-256 etc.) are overkill for routing: the router doesn't need collision-resistance against adversaries, just good avalanche + speed. A ~1 GB/s per-core hash beats a ~100 MB/s cryptographic one, and routing sits on the hot path of every query.

Trade-off vs the other strategies

Property Hash Range Lookup
Even distribution Yes — property of the hash No — requires distribution prior Yes — operator assigns
Range scans on shard key Scatter-gather Efficient Scatter-gather
Routing cost 1 hash 1 range-tree lookup 1 lookup-table read (extra hop)
Handles monotonic IDs Good Bad (frontier hotspot) Good (operator places)
Reshard cost Low (consistent hash) Medium (range splits) High (rewrite mapping table)

Seen in

Last updated · 347 distilled / 1,201 read