Skip to content

SYSTEM Cited by 1 source

Apache Spark

Apache Spark is the de-facto general-purpose distributed data-processing engine of the 2010s — a Scala/Java dataflow runtime with rich, safe higher-level abstractions (Spark SQL, DataFrame/Dataset APIs, MLlib, Structured Streaming) commonly run on cloud services like systems/amazon-emr or AWS Glue. Its defining property for this wiki is that it was the "default generalist" for large-scale data processing on cloud object stores for a decade — and, at the upper end of that curve, the substrate teams migrate off once their workload becomes painful enough to justify a specialist engine.

Role as the generalist

Where Spark stops scaling (per Amazon BDT)

Quoting from the Spark → Ray migration post:

  • "Compacting all in-scope tables in their catalog was becoming too expensive."
  • "Manual job tuning was required to successfully compact their largest tables."
  • "Compaction jobs were exceeding their expected completion times."
  • "Limited options to resolve performance issues due to Apache Spark successfully (and unfortunately in this case) abstracting away most of the low-level data processing details."

In other words: the generality that made Spark attractive becomes a ceiling when the workload has specialist structure (CDC compaction with copy-on-write over exabyte-scale Iceberg-like tables) and specialist cost (tens of millions of dollars per year).

Reliability as the upside

The other side of the "don't migrate Spark just because you can" coin: Spark is extremely reliable for this workload. In 2024, Amazon BDT reports Spark's first-time compaction job success rate at 99.91%, vs Ray's 99.15% — and Spark's reliability was even more ahead as recently as 2023 (Ray trailed by up to 15%).

The blog's own cautionary framing:

"do these results imply that you should also start migrating all of your data processing jobs from Apache Spark to Ray? Well, probably not – not yet, at least. Data processing frameworks like Apache Spark continue to offer feature-rich abstractions for data processing that will likely continue to work well enough for day-to-day use-cases. There's also no paved road today to automatically translate your Apache Spark applications over to Ray-native equivalents."

(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Typical deployment

  • systems/amazon-emr — AWS's managed Hadoop/Spark runtime; the substrate for Amazon BDT's original compactor.
  • systems/aws-glue — AWS's serverless Spark + catalog service.
  • Databricks' own runtime (Databricks was founded by the Spark creators; see companies/databricks).

Seen in

Last updated · 200 distilled / 1,178 read