PATTERN Cited by 3 sources

Shard key aligned with query pattern¶

Pattern¶

Choose the shard key so that the dominant query's predicate contains the shard key. The single-most-common query routes to exactly one shard; every other query becomes cross-shard (and the system accepts that cost, or compensates for it with lookup Vindexes / OLAP replicas / application-level fan-out).

This is the third of the three canonical shard-key selection criteria — after cardinality and volatility. It's the one that connects the data model to the workload: sharding is not a data-modelling problem, it's a query-routing problem solved with a data-model lever.

When to apply¶

Every horizontally-sharded table needs a shard key. The pattern applies to every one of those decisions. The default choice should be driven by the question "what's the dominant query against this table, and does its predicate include a column that also satisfies the cardinality + volatility criteria?" — not by "what column is the primary key" or "what column is the most unique."

Dicken's canonical formulation¶

Ben Dicken's two posts canonicalise the pattern through paired worked examples:

On a hypothetical steps table sharded by step_count (a low-cardinality column) — range-scan queries for "all steps for user X" scatter across every shard, because step_count has no correlation with user_id. Re-sharding the same table by user_id collapses the dominant query to a single shard (Source: sources/2026-04-21-planetscale-database-sharding).

On a hypothetical exercise_log table:

"We can solve this problem by instead using the user_id as the hashed shard key. Each user_id will produce the same hash, and thus get sent to the same server. This means that when a user adds log events or reads their log, it will all hit the same server." (Source: sources/2026-04-21-planetscale-dealing-with-large-tables)

The structural test is: what fraction of the workload is "all X for user Y"-style queries? If it's the dominant fraction, the shard key should be user_id. If the dominant fraction is "all X in time range T", the shard key is something else — probably time-bucketed — and you accept that per-user queries become cross-shard.

Figma's production instance¶

Figma's production sharding design names three shard keys — UserID, FileID, OrgID — because no single key covers every workload. Tables sharing a shard key are grouped into colos; cross-colo queries are expected and tolerated. The explicit framing on the source page: "we can answer the dominant query on each colo with a single-shard query; cross-colo queries are rare and planned" (Source: sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale).

The Figma lesson is that a handful of query-pattern-aligned shard keys beats one synthetic key that aligns with nothing. The operational overhead of three colos is lower than the cross-shard tax of a single suboptimal key.

Three criteria, not one¶

The pattern is necessary but not sufficient. The shard key must also satisfy:

Cardinality — enough distinct values that hash sharding distributes rows evenly. A query-pattern-aligned key with only 10 distinct values creates 10 mega-hotspots.
Volatility — the shard key should be immutable (or near-immutable). Mutating a row's shard key means physically migrating it to a different shard — an expensive, complex operation that most Vitess-style systems don't support automatically.
Query-pattern alignment — this pattern. The dominant query's predicate must include the shard key.

A well-chosen shard key satisfies all three. When they conflict, the operational choice is typically: cardinality first (can't be worked around), volatility second (can sometimes be worked around with lookup vindexes), query-pattern third (can be partially worked around with cross-shard queries + caching).

Anti-patterns¶

Shard by primary key because it's unique and high-cardinality. Auto-increment primary keys are the classic trap: high cardinality but often orthogonal to the dominant query. Produces even distribution but forces every per-entity query to scatter-gather.
Shard by created_at because the table is time-series. Time-series tables' dominant query is usually "all X in a given time window", which range-sharded-by-time serves well — but it produces a write frontier because every insert targets the current-time shard.
Shard by the same key as the foreign-key column without checking alignment. Joining orders (sharded by order_id) to users (sharded by user_id) doesn't give you shard-local joins — you need the tables to share a colo by shard key, not by foreign-key direction.

When the dominant query changes¶

The pattern is stable but the workload isn't. If the dominant query shifts (a new feature becomes the hot path, a read-pattern changes after a product pivot), the shard-key choice may become misaligned and cross-shard queries start dominating. The architectural response is typically:

Add a secondary lookup Vindex to map the new dominant query's predicate to shard IDs, at the cost of an extra hop per query.
Re-shard the table onto a new shard key — a multi-week operation using MoveTables + Reshard, with a VDiff-verified cutover.
Accept the cross-shard tax — if the new dominant query isn't actually that hot, scatter-gather may be cheaper than re-sharding.

This is why shard-key selection is a high-stakes, hard-to-reverse decision that warrants up-front workload analysis (top 5-10 queries by frequency + latency budget), not just schema inspection.

Seen in¶

sources/2026-04-21-planetscale-database-sharding — Ben Dicken's canonical pedagogical treatment. Three shard-key selection criteria (cardinality / volatility / query-pattern alignment) named explicitly. Worked examples: steps-by-step_count (misaligned, scatter-gather) vs steps-by-user_id (aligned, single-shard).
sources/2026-04-21-planetscale-dealing-with-large-tables — Dicken's earlier post canonicalises the same principle via the exercise_log table: hashing by log_id is "terrible for performance" on per-user reads; hashing by user_id routes every "show me this user's logs" query to a single shard.
sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale — Figma's production three-shard-key design (UserID, FileID, OrgID) as the alternative to a single synthetic key that aligns with nothing. Colo framing: shared shard key = shared physical layout = single-shard query + full transaction support within the colo.