Skip to content

NETFLIX 2026-06-19

Read original ↗

The Evolution of Cassandra Data Movement at Netflix

Summary

Netflix describes the complete replacement of Casspactor — their legacy Cassandra-to-Iceberg data-movement engine — with a new layered architecture built on Apache Cassandra Analytics and the internal "Move Data" framework. The legacy system processed ~1,200 data movements per day, transferring ~3 PB from Cassandra into Iceberg tables. The new architecture eliminates fragile multi-service metadata dependencies by reading directly from S3 backups, handles wide/skewed partitions without OOM errors, removes costly intermediate Iceberg tables, and provides time-travel capabilities. The migration was executed using a three-pillar strategy (Validation, Visibility, Safety) anchored by a Decider Pattern implemented via Maestro that enabled zero-impact fallback to the legacy system during rollout.

Key Takeaways

  1. Backup-as-source-of-truth: Casspactor assembled backup metadata from multiple independent services (each with its own failure modes), causing silent data divergence. The new system reads backup existence and completeness metadata directly from S3, collapsing the dependency chain to a single source of truth.

  2. Layered connector architecture: The new stack separates concerns into (a) a Cassandra Analytics Wrapper that reads SSTables from S3 backups and produces standard Spark DataFrames, and (b) a Connector Factory where each data abstraction (Key Value, Time Series, etc.) builds model-aware connectors over those DataFrames — eliminating expensive post-processing stages.

  3. Wide-partition handling: By moving mutation compaction to the Spark executor level, the new engine handles highly skewed partitions without excessive data shuffling, resolving OOM failures that plagued Casspactor on Netflix's largest datasets.

  4. Elimination of intermediate tables: Casspactor wrote to intermediate Iceberg tables; higher-level connectors (KV, etc.) added more intermediates. The new architecture directly produces Spark DataFrames → final Iceberg output, saving millions of USD in storage costs.

  5. Time travel: The new system processes schema, cluster topology, and data as a cohesive unit at a specific point in time, enabling audit, debugging, and disaster recovery of prior states — impossible with Casspactor's multi-service metadata composition.

  6. Auto-sizing: Jobs dynamically adjust resource consumption (executors, memory) based on source table characteristics, removing manual tuning burden.

  7. Like-for-like migration contract: The migration maintained absolute consistency across user-facing interfaces, output contracts, and final data artifacts — same schema, metadata, and data in destination Iceberg tables. This transformed a distributed multi-team effort into an internal platform implementation detail.

  8. Shadow validation (Pillar 1): New connector ran in parallel ("shadow" mode) with production Casspactor jobs. Trust metric: prove C = M (row-by-row set equality between legacy and new output). Any difference triggered immediate investigation.

  9. Decider Pattern via Maestro (Pillar 3): A Decider step in the Maestro workflow invokes a Connector Controller registry to route jobs dynamically between legacy and new connectors. On new-connector failure, the workflow immediately falls back to Casspactor — zero user impact, slightly longer runtime at worst.

  10. Scale numbers: ~1,200 data movements/day; ~3 PB transferred daily from Cassandra to Iceberg; cost savings in the order of USD millions from eliminating intermediates.

Systems & Concepts Extracted

Systems

  • Casspactor (legacy) — Netflix's previous Cassandra-to-Iceberg engine
  • Netflix Data Bridge — unified management plane for batch data movement
  • Cassandra Analytics Wrapper — new core S3-reading layer built on open-source Cassandra Analytics
  • Move Data connector — new Cassandra-to-Iceberg connector atop the wrapper
  • Connector Controller — dynamic registry/control plane for connector routing
  • Maestro — Netflix's workflow orchestrator implementing the Decider Pattern
  • Apache Cassandra, Apache Iceberg, Apache Spark, Amazon S3

Concepts

  • SSTable backup-based data movement (sidecar uploads to S3)
  • Metadata dependency fragility
  • S3 as single source of truth for backup completeness
  • Connector Factory / layered connector architecture
  • Auto-sizing based on source table characteristics
  • Time travel (schema + topology + data as cohesive unit)

Patterns

  • Decider Pattern — workflow routing via control-plane registry + automatic fallback
  • Shadow validation — parallel execution for row-by-row equality proof
  • Like-for-like migration — preserve external contracts, change internals only
  • Connector Factory — shared DataFrame foundation, model-aware connectors on top
  • Fallback on new connector failure — conditional Maestro step for safety

Operational Numbers

Metric Value
Data movements/day ~1,200
Daily transfer volume ~3 PB
Cost savings USD millions (from intermediate table elimination)
Validation target 100% row-level similarity (C = M)

Caveats

  • The post does not disclose specific latency numbers for the new vs. old connector.
  • No quantified reliability improvement (error-rate reduction) is given.
  • The Cassandra Analytics open-source project is referenced but architectural details of Netflix's wrapper extensions are not fully disclosed.
  • Exact auto-sizing heuristics are not described.

Source

Last updated · 546 distilled / 1,578 read