Skip to content

CONCEPT Cited by 1 source

Wide partition problem

A wide partition is a Cassandra (or more generally, any partition-key-oriented) storage anti-pattern where a single partition grows so large that reads against it become slow, memory-pressured, or outright unsuccessful. In Cassandra specifically, a partition lives on one replica set, and the entire partition's rows, tombstones, and metadata must be scanned for range reads — so "too wide" translates into bounded-memory reads, high GC pressure, and latency regressions.

Why the two-level-map data model exposes this

Netflix's KV DAL uses a two-level-map shape (HashMap<String, SortedMap<Bytes, Bytes>>) where everything under a single id maps to one Cassandra partition. The data model is ergonomic — one id bundles "all facts about this thing" — but it puts the onus on application authors (or the DAL) to prevent any one id from growing unboundedly.

Netflix names this explicitly as one of the failure modes the DAL was built to address:

"developers had to constantly re-learn new data modeling practices and common yet critical data access patterns. These include challenges with ... managing 'wide' partitions with many rows, handling single large 'fat' columns, and slow response pagination." (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)

  • Wide partition — too many rows under one partition key. Scans + maintenance (repair, compaction) operate on the entire partition; Cassandra has hard practical limits measured in tens to low hundreds of MB per partition before instability.
  • Fat column — a single row whose value blob is gigantic. Similar memory-pressure failure, but localized to one cell rather than spread across many rows.

Both are partition-scale anti-patterns on the same physical axis (how much must a single node hold for this logical unit?).

What the KV DAL does about it

KV's transparent chunking addresses the fat-column case directly: items > 1 MiB have their bodies split into separately-partitioned chunks, with only id / key / metadata staying in the main table. The large value no longer widens the partition.

The DAL hints at but does not fully describe the many-rows- under-one-id remedies. The post's future-work list mentions:

"Summarization: Techniques to improve retrieval efficiency by summarizing records with many items into fewer backing rows."

— which reads as the DAL's planned response to the many-rows side of the problem.

Classic Cassandra-side symptoms

  • nodetool cfstats reporting very high Max partition size on a table.
  • Coordinator-level timeouts on queries that scan whole partitions.
  • GC storms on the replica owning the partition.
  • Repair failures because the partition doesn't fit in the configured stream / anti-entropy buffer size.
  • "Tombstone threshold exceeded" warnings if many deletes pile up in a single wide partition (see concepts/tombstone + concepts/ttl-based-deletion-with-jitter).

Common application-level remedies (outside KV DAL)

  • Bucket the partition key — encode time / hash-prefix / rolling-index into the partition key so natural growth stays bounded per partition.
  • Cap the clustering dimension — only keep the "n newest by key" under one partition; use range deletes (single-tombstone, see systems/netflix-kv-dal) to evict the rest.
  • Promote to a different data model — sometimes the workload really is row-count unbounded and doesn't fit a single-partition shape; a different data model or engine is the right move.

Trade-offs

  • Addressing wide partitions pushes complexity up — bucket keys mean cross-bucket scans; summarisation means a compute tier that aggregates before storage. The DAL can centralize this, but the cost doesn't disappear.
  • Hard to cap enforcement — the DAL can warn on wide- partition symptoms but can't unilaterally reject adversarial growth without breaking callers.

Seen in

Last updated · 319 distilled / 1,201 read