NETFLIX 2026-06-19

The Evolution of Cassandra Data Movement at Netflix¶

Summary¶

Netflix describes the complete replacement of Casspactor — their legacy Cassandra-to-Iceberg data-movement engine — with a new layered architecture built on Apache Cassandra Analytics and the internal "Move Data" framework. The legacy system processed ~1,200 data movements per day, transferring ~3 PB from Cassandra into Iceberg tables. The new architecture eliminates fragile multi-service metadata dependencies by reading directly from S3 backups, handles wide/skewed partitions without OOM errors, removes costly intermediate Iceberg tables, and provides time-travel capabilities. The migration was executed using a three-pillar strategy (Validation, Visibility, Safety) anchored by a Decider Pattern implemented via Maestro that enabled zero-impact fallback to the legacy system during rollout.

Key Takeaways¶

Backup-as-source-of-truth: Casspactor assembled backup metadata from multiple independent services (each with its own failure modes), causing silent data divergence. The new system reads backup existence and completeness metadata directly from S3, collapsing the dependency chain to a single source of truth.
Layered connector architecture: The new stack separates concerns into (a) a Cassandra Analytics Wrapper that reads SSTables from S3 backups and produces standard Spark DataFrames, and (b) a Connector Factory where each data abstraction (Key Value, Time Series, etc.) builds model-aware connectors over those DataFrames — eliminating expensive post-processing stages.
Wide-partition handling: By moving mutation compaction to the Spark executor level, the new engine handles highly skewed partitions without excessive data shuffling, resolving OOM failures that plagued Casspactor on Netflix's largest datasets.
Elimination of intermediate tables: Casspactor wrote to intermediate Iceberg tables; higher-level connectors (KV, etc.) added more intermediates. The new architecture directly produces Spark DataFrames → final Iceberg output, saving millions of USD in storage costs.
Time travel: The new system processes schema, cluster topology, and data as a cohesive unit at a specific point in time, enabling audit, debugging, and disaster recovery of prior states — impossible with Casspactor's multi-service metadata composition.
Auto-sizing: Jobs dynamically adjust resource consumption (executors, memory) based on source table characteristics, removing manual tuning burden.
Like-for-like migration contract: The migration maintained absolute consistency across user-facing interfaces, output contracts, and final data artifacts — same schema, metadata, and data in destination Iceberg tables. This transformed a distributed multi-team effort into an internal platform implementation detail.
Shadow validation (Pillar 1): New connector ran in parallel ("shadow" mode) with production Casspactor jobs. Trust metric: prove C = M (row-by-row set equality between legacy and new output). Any difference triggered immediate investigation.
Decider Pattern via Maestro (Pillar 3): A Decider step in the Maestro workflow invokes a Connector Controller registry to route jobs dynamically between legacy and new connectors. On new-connector failure, the workflow immediately falls back to Casspactor — zero user impact, slightly longer runtime at worst.
Scale numbers: ~1,200 data movements/day; ~3 PB transferred daily from Cassandra to Iceberg; cost savings in the order of USD millions from eliminating intermediates.

Systems & Concepts Extracted¶

Systems¶

Casspactor (legacy) — Netflix's previous Cassandra-to-Iceberg engine
Netflix Data Bridge — unified management plane for batch data movement
Cassandra Analytics Wrapper — new core S3-reading layer built on open-source Cassandra Analytics
Move Data connector — new Cassandra-to-Iceberg connector atop the wrapper
Connector Controller — dynamic registry/control plane for connector routing
Maestro — Netflix's workflow orchestrator implementing the Decider Pattern
Apache Cassandra, Apache Iceberg, Apache Spark, Amazon S3

Concepts¶

SSTable backup-based data movement (sidecar uploads to S3)
Metadata dependency fragility
S3 as single source of truth for backup completeness
Connector Factory / layered connector architecture
Auto-sizing based on source table characteristics
Time travel (schema + topology + data as cohesive unit)

Patterns¶

Decider Pattern — workflow routing via control-plane registry + automatic fallback
Shadow validation — parallel execution for row-by-row equality proof
Like-for-like migration — preserve external contracts, change internals only
Connector Factory — shared DataFrame foundation, model-aware connectors on top
Fallback on new connector failure — conditional Maestro step for safety

Operational Numbers¶

Metric	Value
Data movements/day	~1,200
Daily transfer volume	~3 PB
Cost savings	USD millions (from intermediate table elimination)
Validation target	100% row-level similarity (C = M)

Caveats¶

The post does not disclose specific latency numbers for the new vs. old connector.
No quantified reliability improvement (error-rate reduction) is given.
The Cassandra Analytics open-source project is referenced but architectural details of Netflix's wrapper extensions are not fully disclosed.
Exact auto-sizing heuristics are not described.

Source¶

systems/netflix-data-bridge — the unified management plane this connector operates within
systems/apache-cassandra — the source database
systems/apache-iceberg — the target table format
systems/netflix-maestro — the workflow orchestrator implementing the Decider Pattern
systems/netflix-kv-dal — one of the data abstractions consuming this connector
systems/netflix-timeseries-abstraction — another consuming abstraction
patterns/shadow-migration — related shadow-mode pattern in the wiki
concepts/wide-partition-problem — the problem the new engine solves