SYSTEM Cited by 6 sources
Apache Cassandra¶
Apache Cassandra is a distributed, wide-column NoSQL database originating at Facebook in 2008 (Lakshman + Malik) and donated to the Apache Software Foundation in 2009. Design lineage: Amazon Dynamo (for replication + partitioning via consistent hashing) + Google Bigtable (for the wide-column data model). Eventually-consistent by default, with tunable per-query consistency levels.
Within this wiki, Cassandra is the canonical on-the-record example of a gossip-driven cluster membership layer running underneath a user-facing distributed database. Unlike the Fly.io Corrosion write-up, the wiki so far covers Cassandra only via third-party explainers — any page citing Cassandra's specific production numbers should trace to the canonical source (the Cassandra wiki's Architecture Gossip page) or to a Cassandra-operator post.
Why it uses gossip¶
Cassandra uses gossip for three distinct purposes (Source: sources/2023-07-16-highscalability-gossip-protocol-explained):
- Cluster membership — which nodes are in the cluster, what tokens do they own, what's their schema version?
- Token-assignment metadata transfer — the consistent-hash ring token layout is propagated via gossip, so every node can route a key to its owning nodes without a coordinator.
- Failure detection — Cassandra uses a phi-accrual detector over gossip heartbeats, not a fixed timeout.
Additionally, Cassandra uses Merkle trees for anti-entropy read-repair (nodetool repair) — a distinct channel from the gossip layer, though architecturally adjacent.
Gossip message shape¶
Cassandra's gossip round is a three-message SYN → ACK → ACK2 exchange (push-pull, see patterns/push-pull-gossip). Each message carries a list of EndPointState records of the form described by sources/2023-07-16-highscalability-gossip-protocol-explained:
EndPointState: 10.0.1.42
HeartBeatState: generation: 1259904231, version: 761
ApplicationState: "average-load": 2.4, generation: 1659909691, version: 42
ApplicationState: "bootstrapping": pxLpassF9XD8Kymj, generation: 1259909615, version: 90
(generation, version) is the partial-order key — see concepts/heartbeat-counter.
Seen in¶
- sources/2023-07-16-highscalability-gossip-protocol-explained — named as the canonical gossip deployment; EndPointState shape reproduced; three gossip duties (membership, token, failure detection) enumerated.
- sources/2023-02-22-highscalability-consistent-hashing-algorithm — named among canonical consistent-hashing deployments.
- sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — Cassandra as the canonical backing engine for Netflix's KV Data Abstraction Layer. KV's uniform two-level map maps directly onto Cassandra's partition-key + clustering-column model; the DDL is given explicitly (
PRIMARY KEY (id, key) WITH CLUSTERING ORDER BY (key)). The post anchors three Cassandra-specific production disciplines Netflix layered into KV DAL: (1) Client-generated monotonic idempotency tokens make hedged/retried writes safe on Cassandra's last-write-wins merge, with the empirical safety claim that "our tests on EC2 Nitro instances show drift is minimal (under 1 millisecond)." KV servers reject writes with large drift to prevent both silent discards (past) and immutable doomstones (future). (2) Tombstone-cost discipline: record-level and range deletes emit one tombstone; item-level deletes fall back to TTL-with-jitter to stagger compaction load — explicit mitigation of Cassandra's well-known high-item-delete pathology. (3) Wide-partition + fat-column management via transparent chunking > 1 MiB — only id/key/metadata stays in the main table, large values split into chunks in a separately-partitioned chunk store (which can itself be Cassandra with a different partition scheme), atomicity bound by one idempotency token. - sources/2026-04-04-netflix-powering-multimodal-intelligence-for-video-search — Cassandra serves two roles in Netflix's multimodal video-search pipeline: (1) the transactional persistence layer underneath Marken, capturing raw per-model annotations (character recognition, scene detection, embeddings) from high-availability ingestion pipelines with "data integrity and high-speed write throughput" as the design posture; (2) the target store for enriched temporal-bucket records written back by the offline-fusion stage — "written back to Cassandra as distinct entities, creating a highly optimized, second-by-second index of multi-modal intersections." Canonical wiki instance of Cassandra as both raw-ingest + fused-state substrate in one pipeline; see patterns/three-stage-ingest-fusion-index and concepts/multimodal-annotation-intersection. Schema details (partition keys, clustering, TTL) not disclosed in this post — linked to a 2021 "Scalable Annotation Service: Marken" post not yet on the wiki.
- sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade
— canonical wiki first-party operator retrospective on a
production Cassandra major-version upgrade. Yelp Database
Reliability Engineering upgraded > 1,000 Cassandra nodes
from 3.11 to 4.1 on Kubernetes with
zero downtime and zero client-code changes. Canonicalises:
the in-place vs new-DC choice (in-place wins at fleet
scale on time, cost, and
EACH_QUORUMpreservation grounds — see concepts/in-place-vs-new-dc-upgrade); the init-container IP-gossip pre-migration trick (CASSANDRA-19244) to sequence IP and version changes when pods get new IPs on restart (see concepts/init-container-ip-gossip-pre-migration); the dual-run version-specific proxy topology for Stargate around Cassandra 4.1'sMigrationCoordinatorschema-fetch behaviour change (see patterns/dual-run-version-specific-proxies); the CDC commit-log write-point change (flush → mutation,CASSANDRA-12148) that breaks any CDC consumer written against 3.x (see concepts/cassandra-cdc-commit-log and systems/cassandra-source-connector); post-upgrade schema disagreement on CDC-enabled clusters remediated by dummy multi-node schema changes (see concepts/schema-disagreement); transient mixed-version latency that self-resolves on homogenisation (see concepts/mixed-version-cluster and concepts/performance-regression-from-mid-upgrade-state). Named Cassandra 4.1 wins: non-disruptive seed-list reload (CASSANDRA-14190) → faster gossip convergence + node restart; usable incremental repairs (CASSANDRA-9143fix); Java 8 → 11; hot-reloadable SSL certs (CEP-9); guardrails framework; denylisted partitions for noisy-neighbour mitigation; full query logging + audit trails; path to Cassandra 5 (ACID + vector search). Production-measured: up to 58% p99 latency reduction on key clusters; own-environment benchmark first measured 4% p99 + 11% mean-latency + 11% throughput improvements. Canonicalised as the wiki's operator-side Cassandra reference — every other Cassandra Seen-in to date is third-party. - sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads — First wiki canonical disclosure of how an operator team fights wide partitions on Cassandra 4.x at petabyte scale without rewriting the table. Netflix TimeSeries Abstraction team disclosure of two complementary mechanisms.
(1) Cassandra introspection used as a control-loop signal:
the team exposes nodetool tablehistograms percentile
distributions through a Cassandra virtual table, polled by
a DynamicTimeSliceConfigWorker that detects partition-size
drift outside a configured 2–10 MiB density window and rewrites
the partition strategy used for future Time Slices. Live
example fixes over-partitioning
on Cassandra (60-second time buckets producing < 10 KB
partitions, generating high read amplification + thread
queueing) by widening the time bucket to 7 days
(time_bucket interval: 60s -> 604800s). Past slices are not
rewritten — only future ones get the new config. Canonicalised
as patterns/auto-tuning-control-loop-on-storage-histograms.
(2) Per-partition dynamic splitting that targets individual
wide partitions by TimeSeries ID rather than the table as a whole.
Three-stage async pipeline (patterns/dynamic-partition-split-async-pipeline):
Detection on the read path emits a Kafka event; a Planner reads
the entire partition once (resumable from checkpoint) and stores
a pre-split checksum; a Splitter writes the split partitions to
a separately-named time-slice table (e.g. wide_data_20260328_0)
using EventBucketPartitionSplitStrategy (capped on ultra-wide
to control read amplification)
and matches a post-split checksum
(concepts/checksum-validated-data-migration) before flipping
status to COMPLETED. Read path consults an in-memory Bloom filter
(concepts/bloom-filter — single-digit-microsecond check)
before doing a wide_row metadata-table lookup (read-through cached)
that returns both pre-split and post-split locations. The same
schema is reused for the split table, allowing the existing
PartitionReader to be delegated to. Original partition is
never deleted — preserved as fallback for partial-failure /
eventual-consistency / rollback safety
(patterns/keep-original-partition-as-fallback-during-split).
Cassandra failure modes named explicitly: wide partitions cause seconds-scale tail latency, GC pauses, high CPU, thread queueing, and read timeouts; over-partitioning causes high read amplification and thread queueing; both fall under the umbrella of wide-partition problem / over-partitioning. Operational outcomes: on TimeSeries datasets, average wide-partition read latency dropped from seconds to low double-digit ms; tail latency dropped from several seconds to ~200 ms; near-zero read timeouts; 500 MB+ partitions paginated successfully while available. Future work named: splitting mutable wide partitions (deferred under reduce-surface-area discipline). Canonicalised on concepts/dynamic-partition-splitting / 7+ new pattern pages.
Related¶
- systems/amazon-dynamo — design ancestor (replication + consistent hashing).
- systems/netflix-kv-dal — canonical Netflix abstraction service fronting Cassandra + EVCache + DynamoDB + RocksDB.
- systems/netflix-marken — Netflix annotation service; Cassandra-backed transactional gate for multimodal video-search ingest.
- concepts/gossip-protocol
- concepts/anti-entropy
- concepts/heartbeat-counter
- concepts/consistent-hashing
- concepts/merkle-tree
- concepts/tombstone · concepts/ttl-based-deletion-with-jitter · concepts/wide-partition-problem · concepts/idempotency-token · concepts/dynamic-partition-splitting · concepts/over-partitioning · concepts/read-side-detection-of-storage-pathology · concepts/immutable-partition · concepts/checksum-validated-data-migration · concepts/bloom-filter · concepts/read-amplification
- patterns/dynamic-partition-split-async-pipeline · patterns/auto-tuning-control-loop-on-storage-histograms · patterns/keep-original-partition-as-fallback-during-split · patterns/bloom-filter-redirect-to-split-partition · patterns/shadow-mode-bytes-comparison · patterns/phased-rollout-of-read-mode · patterns/partial-return-on-slo-breach — TimeSeries 2026-06-03 wide-partition-fight patterns.
- systems/swim-protocol
- systems/stargate-cassandra-proxy · systems/cassandra-source-connector · systems/kubernetes-init-containers — ecosystem components from Yelp's upgrade.
- concepts/rolling-upgrade · concepts/mixed-version-cluster · concepts/in-place-vs-new-dc-upgrade · concepts/cassandra-cdc-commit-log · concepts/schema-disagreement · concepts/init-container-ip-gossip-pre-migration · concepts/anti-entropy-repair-pause · concepts/performance-regression-from-mid-upgrade-state — upgrade-time concepts.
- patterns/version-specific-images-per-git-branch · patterns/dual-run-version-specific-proxies · patterns/pre-flight-flight-post-flight-upgrade-stages · patterns/benchmark-in-own-environment-before-upgrade · patterns/production-qualification-criteria-upfront — upgrade patterns.