CONCEPT Cited by 2 sources
Hash sharding¶
Hash sharding routes each row to a shard by hash(shard_key) — a deterministic hash function maps every shard-key value into a fixed output range, and each shard owns a sub-range of hash outputs. The default production sharding strategy in most sharded relational systems (including the standard Vitess hash Vindex configuration); one of the four sharding strategies enumerated by Ben Dicken (Source: sources/2026-04-21-planetscale-database-sharding).
Core property: even distribution from dissimilar outputs¶
Dicken's framing:
"The nice thing about hashes is that similar inputs can produce very different outputs. We might pass in the name
josephand get hash45, but the namejosephineproduces 28. Similar names, completely different hashes. This means similar values may end up on totally different servers, a good property to help the data get evenly spread out." (Source: sources/2026-04-21-planetscale-database-sharding)
Hash sharding inherits distribution uniformity from the hash function — a property that holds regardless of whether the underlying shard-key values are uniformly distributed, monotonically increasing, or skewed. This is why hash sharding is the default production choice: it doesn't require the operator to know the value distribution in advance, and it stays balanced as the distribution drifts.
Core cost: scatter-gather on shard-key range scans¶
The flip side. WHERE shard_key BETWEEN a AND b fans out to every shard under hash sharding — sequential shard-key values hash to independent shards, so a range scan can't be narrowed to a subset. The query becomes scatter-gather / cross-shard. Acceptable when shard-key range scans are rare (the common OLTP case); problematic when they dominate (analytics, time-series, sequential batch iteration).
Pairs with high-cardinality shard keys¶
Hash sharding distributes well only if the shard key has enough cardinality that the hash outputs cover the range evenly. Very-skewed keys — e.g. a country_code column where 80% of rows are US — still concentrate: hash("US") is a single hash output, so every US row lands on one shard regardless of how many shards exist. Dicken's recommendation: "Often a column like user_id is a good choice because each value is unique. We also get the added benefit of hash speed. It's faster to hash a fixed-size integer as compared to a variable-width name string."
Relationship to consistent hashing¶
Consistent hashing is the specific family of hash-sharding algorithms where adding or removing a shard only re-maps 1/N of the keys (rather than re-hashing the whole dataset). Production databases almost always use consistent hashing as the hash-sharding implementation so that resharding is a small-delta operation, not a full rewrite. Range-sharding-on-hash-output (each shard owns a contiguous hash range) is a common implementation shape, because it composes with keyspace_id-style opaque shard addressing.
Cryptographic vs non-cryptographic hash¶
Dicken's primer uses "cryptographic hash" for pedagogical simplicity, but production routing generally uses fast non-cryptographic mixers — xxhash, CityHash, MurmurHash3, or the Vitess-specific hash Vindex producing a binary(8) keyspace_id. Cryptographic hashes (SHA-256 etc.) are overkill for routing: the router doesn't need collision-resistance against adversaries, just good avalanche + speed. A ~1 GB/s per-core hash beats a ~100 MB/s cryptographic one, and routing sits on the hot path of every query.
Trade-off vs the other strategies¶
| Property | Hash | Range | Lookup |
|---|---|---|---|
| Even distribution | Yes — property of the hash | No — requires distribution prior | Yes — operator assigns |
| Range scans on shard key | Scatter-gather | Efficient | Scatter-gather |
| Routing cost | 1 hash | 1 range-tree lookup | 1 lookup-table read (extra hop) |
| Handles monotonic IDs | Good | Bad (frontier hotspot) | Good (operator places) |
| Reshard cost | Low (consistent hash) | Medium (range splits) | High (rewrite mapping table) |
Seen in¶
- sources/2026-04-21-planetscale-database-sharding — canonical pedagogical treatment;
joseph/josephineframing of the dissimilar-outputs property; recommendation thatuser_idis the canonical hash-shard-key choice (high cardinality + fixed-width integer hash speed). - sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale — Figma's production choice: hash the shard key as the mitigation for monotonic-ID hotspots; trade-off is shard-key range-scan inefficiency.
- — Vitess's hash Primary Vindex is the implementation mechanism; the
keyspace_idoutput is the routing address. - — Holly Guevara (PlanetScale, 2024-07-08) canonicalises the PlanetScale-voice customer-consultation default: "At PlanetScale, we typically steer customers toward hash-based sharding as the default. It generally gives you the most even distribution and isn't overly complicated." Short-form sibling of Dicken's later post; also names "you're taking human guesswork out of the equation" + easier-to-reshard property (consistent-hashing substrate implicit).
- sources/2026-04-21-planetscale-dealing-with-large-tables — Ben Dicken (PlanetScale, 2024-07-10) canonicalises hash sharding as the default strategy for a newly-sharded table in the three-rung scaling-ladder post. The
exercise_logworked example illustrates thelog_idvsuser_idshard-key choice under hashing:hash(log_id)distributes evenly but scatters per-user reads;hash(user_id)routes per-user queries to one shard. Canonical wiki datum: hash sharding's even-distribution property is necessary but not sufficient — the shard-key choice interacts with the dominant-query pattern. Also names Vitess's shard-range naming convention-40 / 40-80 / 80-c0 / c0-as the concrete implementation of hash-range-per-shard partitioning. -
— Savannah Longoria (PlanetScale, 2022-12-14) canonicalises
xxhashas the named Vitess hash Vindex function in a production VSchema: "we definedxxhashas the Vindex or specific sharding function." First wiki source naming a concrete non-cryptographic mixer used on the Vitess routing hot path — empirically consistent with the cryptographic-vs-non-cryptographic discussion above. Canonical production VSchema excerpt:"xxhash": {"type": "xxhash"}in the keyspace-levelvindexesblock + referenced by each sharded table'scolumnVindexes. Also the canonical wiki example of hash sharding applied to an externally-shard-addressed workload: Temporal pre-computesshard_id/range_hashfor every row before writing; Vitess'sxxhashVindex hashes that column (not raw tenant ID), preserving Temporal's own shard-ownership discipline via the storage layer. -
— Justin Gage (guest post, 2023-04-06) provides a broad-audience framing of hash sharding as "key-based sharding" ("Hash based sharding (also known as key based)") and names the canonical worked example of the Amazon-orders-no-meaningful-clustering case: "It might make sense to use hash based sharding, and use the order ID as the shard key." Strongest signal for the "hash is the general-purpose default when your load doesn't have natural clustering" recommendation. Gage flattens the distinction between hash sharding and consistent hashing — using a cryptographic hash as the pedagogical example where Vitess-era production would use xxhash — so readers need to cross-reference the other wiki sources to get the cryptographic-vs-non-cryptographic performance story.