SYSTEM Cited by 1 source
Apache Spark¶
Apache Spark is the de-facto general-purpose distributed data-processing engine of the 2010s — a Scala/Java dataflow runtime with rich, safe higher-level abstractions (Spark SQL, DataFrame/Dataset APIs, MLlib, Structured Streaming) commonly run on cloud services like systems/amazon-emr or AWS Glue. Its defining property for this wiki is that it was the "default generalist" for large-scale data processing on cloud object stores for a decade — and, at the upper end of that curve, the substrate teams migrate off once their workload becomes painful enough to justify a specialist engine.
Role as the generalist¶
- Rich, safe, high-level abstractions for distributed data processing.
- Runs well on HDFS or cloud object stores like systems/aws-s3.
- Hosts open table formats: systems/apache-iceberg / systems/apache-hudi / systems/delta-lake all have Spark writers/readers as first-class.
- Dominant engine for batch analytics, CDC merge (concepts/copy-on-write-merge), and ELT over cloud warehouses.
Where Spark stops scaling (per Amazon BDT)¶
Quoting from the Spark → Ray migration post:
- "Compacting all in-scope tables in their catalog was becoming too expensive."
- "Manual job tuning was required to successfully compact their largest tables."
- "Compaction jobs were exceeding their expected completion times."
- "Limited options to resolve performance issues due to Apache Spark successfully (and unfortunately in this case) abstracting away most of the low-level data processing details."
In other words: the generality that made Spark attractive becomes a ceiling when the workload has specialist structure (CDC compaction with copy-on-write over exabyte-scale Iceberg-like tables) and specialist cost (tens of millions of dollars per year).
Reliability as the upside¶
The other side of the "don't migrate Spark just because you can" coin: Spark is extremely reliable for this workload. In 2024, Amazon BDT reports Spark's first-time compaction job success rate at 99.91%, vs Ray's 99.15% — and Spark's reliability was even more ahead as recently as 2023 (Ray trailed by up to 15%).
The blog's own cautionary framing:
"do these results imply that you should also start migrating all of your data processing jobs from Apache Spark to Ray? Well, probably not – not yet, at least. Data processing frameworks like Apache Spark continue to offer feature-rich abstractions for data processing that will likely continue to work well enough for day-to-day use-cases. There's also no paved road today to automatically translate your Apache Spark applications over to Ray-native equivalents."
(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Typical deployment¶
- systems/amazon-emr — AWS's managed Hadoop/Spark runtime; the substrate for Amazon BDT's original compactor.
- systems/aws-glue — AWS's serverless Spark + catalog service.
- Databricks' own runtime (Databricks was founded by the Spark creators; see companies/databricks).
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Spark on systems/amazon-emr as the "compactor" for Amazon Retail's exabyte-scale tables (2019+); eventually displaced by systems/ray for the largest ~1% of tables that accounted for ~40% of Spark compaction cost and the majority of failures; Ray delivered 82% better cost efficiency with Spark still ahead on reliability (99.91% vs 99.15% in 2024).
Related¶
- systems/ray — the specialist replacement for Spark in the BDT compactor migration.
- systems/amazon-emr — Spark's canonical managed AWS substrate.
- systems/apache-iceberg, systems/apache-hudi, systems/delta-lake — open table formats Spark writes to natively.