CONCEPT Cited by 2 sources
Wide partition problem¶
A wide partition is a Cassandra (or more generally, any partition-key-oriented) storage anti-pattern where a single partition grows so large that reads against it become slow, memory-pressured, or outright unsuccessful. In Cassandra specifically, a partition lives on one replica set, and the entire partition's rows, tombstones, and metadata must be scanned for range reads — so "too wide" translates into bounded-memory reads, high GC pressure, and latency regressions.
Why the two-level-map data model exposes this¶
Netflix's KV DAL uses a
two-level-map shape
(HashMap<String, SortedMap<Bytes, Bytes>>) where everything
under a single id maps to one Cassandra partition. The data
model is ergonomic — one id bundles "all facts about this thing"
— but it puts the onus on application authors (or the DAL) to
prevent any one id from growing unboundedly.
Netflix names this explicitly as one of the failure modes the DAL was built to address:
"developers had to constantly re-learn new data modeling practices and common yet critical data access patterns. These include challenges with ... managing 'wide' partitions with many rows, handling single large 'fat' columns, and slow response pagination." (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)
Two related failure shapes (Netflix framing)¶
- Wide partition — too many rows under one partition key. Scans + maintenance (repair, compaction) operate on the entire partition; Cassandra has hard practical limits measured in tens to low hundreds of MB per partition before instability.
- Fat column — a single row whose value blob is gigantic. Similar memory-pressure failure, but localized to one cell rather than spread across many rows.
Both are partition-scale anti-patterns on the same physical axis (how much must a single node hold for this logical unit?).
What the KV DAL does about it¶
KV's transparent
chunking addresses the fat-column case directly: items > 1
MiB have their bodies split into separately-partitioned chunks,
with only id / key / metadata staying in the main table. The
large value no longer widens the partition.
The DAL hints at but does not fully describe the many-rows- under-one-id remedies. The post's future-work list mentions:
"Summarization: Techniques to improve retrieval efficiency by summarizing records with many items into fewer backing rows."
— which reads as the DAL's planned response to the many-rows side of the problem.
Classic Cassandra-side symptoms¶
nodetool cfstatsreporting very highMax partition sizeon a table.- Coordinator-level timeouts on queries that scan whole partitions.
- GC storms on the replica owning the partition.
- Repair failures because the partition doesn't fit in the configured stream / anti-entropy buffer size.
- "Tombstone threshold exceeded" warnings if many deletes pile up in a single wide partition (see concepts/tombstone + concepts/ttl-based-deletion-with-jitter).
Common application-level remedies (outside KV DAL)¶
- Bucket the partition key — encode time / hash-prefix / rolling-index into the partition key so natural growth stays bounded per partition.
- Cap the clustering dimension — only keep the "n newest by key" under one partition; use range deletes (single-tombstone, see systems/netflix-kv-dal) to evict the rest.
- Promote to a different data model — sometimes the workload really is row-count unbounded and doesn't fit a single-partition shape; a different data model or engine is the right move.
Trade-offs¶
- Addressing wide partitions pushes complexity up — bucket keys mean cross-bucket scans; summarisation means a compute tier that aggregates before storage. The DAL can centralize this, but the cost doesn't disappear.
- Hard to cap enforcement — the DAL can warn on wide- partition symptoms but can't unilaterally reject adversarial growth without breaking callers.
Seen in¶
- sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — named alongside fat-column as one of the recurring data- modeling footguns the DAL exists to address; summarisation as planned future work.
- sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads — First wiki canonical full-disclosure of how an operator team fights wide partitions on Cassandra at petabyte scale. Netflix TimeSeries Abstraction team disclosure. Complete failure-mode enumeration on Cassandra 4.x:
"In extreme cases, if most of the reads target wide partitions, we can see Garbage Collection pauses, high CPU utilization and thread queueing."
Beyond GC pauses + CPU + thread queueing, the post documents seconds-scale tail latency and read timeouts. The team's response is a two-mechanism architecture rather than a single fix:
- Table-level auto-tuning control loop —
DynamicTimeSliceConfigWorkerpolls partition-size histograms via Cassandra virtual tables mirroringnodetool tablehistograms, detects observed-vs-target drift (target window 2–10 MiB), and rewrites the partition strategy used for future Time Slices. Live example fixes over-partitioning (60-second time buckets producing < 10 KB partitions) by widening the time bucket to 7 days. Past slices are not rewritten. See patterns/auto-tuning-control-loop-on-storage-histograms. - Per-ID dynamic partition splitting — async pipeline that detects wide partitions on the read path (concepts/read-side-detection-of-storage-pathology), splits immutable partitions only, validates with pre/post checksums, and serves reads via a Bloom- filter gate that routes to a separate split table while keeping the original as fallback (patterns/keep-original-partition-as-fallback-during-split). See concepts/dynamic-partition-splitting / patterns/dynamic-partition-split-async-pipeline.
Mid-stack remedies for cases that don't qualify for dynamic
splitting: partial-return
on SLO breach for latency-prioritising clients, manual ID
block-listing (dgwts.config.<dataset>.block.Ids) for adversarial
IDs, and "do nothing" if app metrics aren't impacted.
Operational outcomes: average wide-partition read latency dropped "from seconds … to low double-digit milliseconds"; tail latency dropped "from several seconds … to around 200 ms or better"; near-zero read timeouts; 500 MB+ partitions paginated successfully while available. Trade-off: storage overhead from the never-deleted original partitions, plus detection lag during which the first reads on a newly-pathological partition still see wide-partition latency.
Future work named: splitting mutable wide partitions (deferred under reduce-surface-area discipline; the post acknowledges "although splitting mutable partitions is possible, it is inherently more complex").
Related¶
- systems/apache-cassandra — the canonical engine where this failure mode bites.
- systems/netflix-kv-dal — the abstraction layer that treats wide partitions as a platform-level concern.
- systems/netflix-timeseries-abstraction — the canonical disclosure for how a TimeSeries-shaped workload fights wide partitions in production.
- concepts/two-level-map-kv-model — the data model whose
idlevel is the partition-key axis. - concepts/dynamic-partition-splitting — the runtime per-ID remediation.
- concepts/over-partitioning — the opposite failure mode (table-level too-small partitions).
- concepts/read-side-detection-of-storage-pathology — the detection-stage design choice.
- concepts/immutable-partition — load-bearing precondition for safe splitting.
- concepts/checksum-validated-data-migration — the correctness gate during splits.
- concepts/read-amplification · concepts/garbage-collection · concepts/tail-latency-at-scale — adjacent failure-shape concepts.
- patterns/transparent-chunking-large-values — solves the fat-column sibling failure.
- patterns/bucketed-event-time-partitioning — the provisioning-time prevention.
- patterns/dynamic-partition-split-async-pipeline · patterns/auto-tuning-control-loop-on-storage-histograms · patterns/keep-original-partition-as-fallback-during-split · patterns/bloom-filter-redirect-to-split-partition · patterns/partial-return-on-slo-breach — the 2026-06-03 wide-partition-fight pattern set.
- concepts/noisy-neighbor — a wide partition on a shared node is a noisy-neighbor source at the storage layer.