CONCEPT Cited by 1 source
Partition skew (data skew)¶
Definition¶
Partition skew — also known as data skew — is the streaming-broker failure mode where records are distributed unevenly across a topic's partitions, so some partitions see dramatically higher record rates, byte rates, or consumer lag than others. Under skew, adding more partitions stops helping: the scarce resource becomes the single hottest partition, not aggregate cluster capacity.
Amdahl's Law for streaming¶
Redpanda canonicalises the framing (Source: sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda):
"Think of this as the data equivalent of Amdahl's Law: data skew is the enemy of parallelization, limiting the benefits of scaling out by using more partitions. If 90% of your data goes through a single partition, then whether you have 10 partitions or 50 won't really make a difference since that single overworked partition is your limiting factor."
Amdahl's Law (parallel computing): if fraction α of a
workload is inherently serial, maximum speedup is 1 / α
regardless of core count. A workload that is 10% serial caps at
10× speedup no matter how many cores you add.
Applied to streaming: if fraction α of records hash to a
single hot partition P, then aggregate topic throughput is capped
at throughput(P) / α. Scaling the partition count from N to
2N only helps the non-skewed 1-α fraction; P (and therefore
the bottleneck) doesn't split. More partitions don't help when
the hot partition is already the bottleneck.
Where skew comes from¶
The partition for a record is chosen by the client-side partitioner:
- Unkeyed records — partitioned by the sticky partitioner (Kafka default since 2.4) or classic round-robin. Neither produces skew — rotation or sticky-window batching spreads records uniformly.
- Keyed records — partitioned by
hash(key) mod N. Skew arises when key frequencies are non-uniform — some keys appear in many more records than others. If one key dominates, all its records concentrate on one partition.
Common real-world key-skew sources:
- Tenant id in a multi-tenant system where one tenant is orders of magnitude larger (whale tenant).
- Sensor id / device id in IoT where most devices are low-traffic but a few are chatty.
- Customer id in B2B e-commerce where enterprise customers dwarf SMB.
- Null key (legacy code paths) hashing to partition 0.
Three-pronged mitigation¶
Redpanda's post enumerates three operator responses:
-
Use the sticky partitioner whenever possible. For workloads that don't require keyed partitioning, sticky distributes records across partitions in batch-sized chunks — guaranteed uniform long-term distribution.
-
Only use keyed partitioners when strictly required. The post names CDC as the canonical justification — per-row ordering matters, so records for the same row must land on the same partition. Most application workloads don't actually need ordering at the key level.
-
When keys are required, pick high- cardinality keys. Verbatim:
"If keys are essential, and you have the option, try to pick keys that have the highest cardinality (those with the highest number of distinct values) for a good chance of distributing the keys well over the number of partitions." Canonicalised as the pattern patterns/high-cardinality-partition-key.
Detection¶
Partition skew is invisible at the aggregate-cluster-metrics altitude. You need per-partition visibility — specifically:
- Per-partition incoming record rate (from
vectorized_kafka_partition_leader_msgsor equivalent). - Per-partition byte rate
(
vectorized_kafka_partition_leader_bytes). - Per-partition consumer lag (from the consumer-group offsets).
A topic with 90th-percentile partition throughput ≫ 50th-percentile is skewed. A topic where one partition has consistently growing lag while others are caught up is consumer-side skewed even if producer-side distribution was even (a slow consumer on one partition).
Why consumer skew compounds¶
The post notes:
"If your application doesn't use partitions evenly, it could start to bottleneck on skewed partitions — if not at the producers, then perhaps at the consumers, which can then lead to processing imbalances at the broker level."
Three-layer compounding: uneven produce ⇒ uneven partition sizes ⇒ uneven consumer work ⇒ uneven broker fetch-request patterns ⇒ noisy-neighbor-style contention on whichever broker leads the hot partitions.
Relationship to other skew concepts¶
- concepts/hot-key — key-level hotspot; partition skew is key-hotspot's partition-level manifestation.
- Sharding has the same problem at a different granularity — skewed shard keys produce hot shards. The mitigation (high-cardinality shard keys) is the same primitive.
- concepts/shard-key-cardinality — the cardinality threshold heuristic (10× to 100× more key values than shards) applies equally to partitioning: a key column with 50 distinct values cannot distribute across 128 partitions.
Seen in¶
- sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda — canonical wiki source. Amdahl's-Law framing; three-pronged mitigation (sticky partitioner / keyed-only-when-required / high-cardinality keys); consumer-compounding argument.
Related¶
- systems/kafka, systems/redpanda — Kafka-API brokers where the partition is the unit of parallelism.
- concepts/kafka-partition — the unit being skewed.
- concepts/sticky-partitioner — first-line mitigation for unkeyed workloads.
- concepts/keyed-partitioner — when keyed is required.
- concepts/shard-key-cardinality — the cardinality threshold heuristic at the sharding granularity.
- concepts/hot-key — finer-grain hotspot framing.
- concepts/horizontal-sharding — same problem, different granularity.
- patterns/high-cardinality-partition-key — the operational pattern for mitigation 3.