Skip to content

SYSTEM Cited by 6 sources

Apache Cassandra

Apache Cassandra is a distributed, wide-column NoSQL database originating at Facebook in 2008 (Lakshman + Malik) and donated to the Apache Software Foundation in 2009. Design lineage: Amazon Dynamo (for replication + partitioning via consistent hashing) + Google Bigtable (for the wide-column data model). Eventually-consistent by default, with tunable per-query consistency levels.

Within this wiki, Cassandra is the canonical on-the-record example of a gossip-driven cluster membership layer running underneath a user-facing distributed database. Unlike the Fly.io Corrosion write-up, the wiki so far covers Cassandra only via third-party explainers — any page citing Cassandra's specific production numbers should trace to the canonical source (the Cassandra wiki's Architecture Gossip page) or to a Cassandra-operator post.

Why it uses gossip

Cassandra uses gossip for three distinct purposes (Source: sources/2023-07-16-highscalability-gossip-protocol-explained):

  1. Cluster membership — which nodes are in the cluster, what tokens do they own, what's their schema version?
  2. Token-assignment metadata transfer — the consistent-hash ring token layout is propagated via gossip, so every node can route a key to its owning nodes without a coordinator.
  3. Failure detection — Cassandra uses a phi-accrual detector over gossip heartbeats, not a fixed timeout.

Additionally, Cassandra uses Merkle trees for anti-entropy read-repair (nodetool repair) — a distinct channel from the gossip layer, though architecturally adjacent.

Gossip message shape

Cassandra's gossip round is a three-message SYN → ACK → ACK2 exchange (push-pull, see patterns/push-pull-gossip). Each message carries a list of EndPointState records of the form described by sources/2023-07-16-highscalability-gossip-protocol-explained:

EndPointState: 10.0.1.42
HeartBeatState: generation: 1259904231, version: 761
ApplicationState: "average-load": 2.4, generation: 1659909691, version: 42
ApplicationState: "bootstrapping": pxLpassF9XD8Kymj, generation: 1259909615, version: 90

(generation, version) is the partial-order key — see concepts/heartbeat-counter.

Seen in

(1) Cassandra introspection used as a control-loop signal: the team exposes nodetool tablehistograms percentile distributions through a Cassandra virtual table, polled by a DynamicTimeSliceConfigWorker that detects partition-size drift outside a configured 2–10 MiB density window and rewrites the partition strategy used for future Time Slices. Live example fixes over-partitioning on Cassandra (60-second time buckets producing < 10 KB partitions, generating high read amplification + thread queueing) by widening the time bucket to 7 days (time_bucket interval: 60s -> 604800s). Past slices are not rewritten — only future ones get the new config. Canonicalised as patterns/auto-tuning-control-loop-on-storage-histograms.

(2) Per-partition dynamic splitting that targets individual wide partitions by TimeSeries ID rather than the table as a whole. Three-stage async pipeline (patterns/dynamic-partition-split-async-pipeline): Detection on the read path emits a Kafka event; a Planner reads the entire partition once (resumable from checkpoint) and stores a pre-split checksum; a Splitter writes the split partitions to a separately-named time-slice table (e.g. wide_data_20260328_0) using EventBucketPartitionSplitStrategy (capped on ultra-wide to control read amplification) and matches a post-split checksum (concepts/checksum-validated-data-migration) before flipping status to COMPLETED. Read path consults an in-memory Bloom filter (concepts/bloom-filter — single-digit-microsecond check) before doing a wide_row metadata-table lookup (read-through cached) that returns both pre-split and post-split locations. The same schema is reused for the split table, allowing the existing PartitionReader to be delegated to. Original partition is never deleted — preserved as fallback for partial-failure / eventual-consistency / rollback safety (patterns/keep-original-partition-as-fallback-during-split).

Cassandra failure modes named explicitly: wide partitions cause seconds-scale tail latency, GC pauses, high CPU, thread queueing, and read timeouts; over-partitioning causes high read amplification and thread queueing; both fall under the umbrella of wide-partition problem / over-partitioning. Operational outcomes: on TimeSeries datasets, average wide-partition read latency dropped from seconds to low double-digit ms; tail latency dropped from several seconds to ~200 ms; near-zero read timeouts; 500 MB+ partitions paginated successfully while available. Future work named: splitting mutable wide partitions (deferred under reduce-surface-area discipline). Canonicalised on concepts/dynamic-partition-splitting / 7+ new pattern pages.

Last updated · 542 distilled / 1,571 read