SYSTEM Cited by 10 sources
Apache Spark¶
Apache Spark is the de-facto general-purpose distributed data-processing engine of the 2010s — a Scala/Java dataflow runtime with rich, safe higher-level abstractions (Spark SQL, DataFrame/Dataset APIs, MLlib, Structured Streaming) commonly run on cloud services like systems/amazon-emr or AWS Glue. Its defining property for this wiki is that it was the "default generalist" for large-scale data processing on cloud object stores for a decade — and, at the upper end of that curve, the substrate teams migrate off once their workload becomes painful enough to justify a specialist engine.
Role as the generalist¶
- Rich, safe, high-level abstractions for distributed data processing.
- Runs well on HDFS or cloud object stores like systems/aws-s3.
- Hosts open table formats: systems/apache-iceberg / systems/apache-hudi / systems/delta-lake all have Spark writers/readers as first-class.
- Dominant engine for batch analytics, CDC merge (concepts/copy-on-write-merge), and ELT over cloud warehouses.
Where Spark stops scaling (per Amazon BDT)¶
Quoting from the Spark → Ray migration post:
- "Compacting all in-scope tables in their catalog was becoming too expensive."
- "Manual job tuning was required to successfully compact their largest tables."
- "Compaction jobs were exceeding their expected completion times."
- "Limited options to resolve performance issues due to Apache Spark successfully (and unfortunately in this case) abstracting away most of the low-level data processing details."
In other words: the generality that made Spark attractive becomes a ceiling when the workload has specialist structure (CDC compaction with copy-on-write over exabyte-scale Iceberg-like tables) and specialist cost (tens of millions of dollars per year).
Reliability as the upside¶
The other side of the "don't migrate Spark just because you can" coin: Spark is extremely reliable for this workload. In 2024, Amazon BDT reports Spark's first-time compaction job success rate at 99.91%, vs Ray's 99.15% — and Spark's reliability was even more ahead as recently as 2023 (Ray trailed by up to 15%).
The blog's own cautionary framing:
"do these results imply that you should also start migrating all of your data processing jobs from Apache Spark to Ray? Well, probably not – not yet, at least. Data processing frameworks like Apache Spark continue to offer feature-rich abstractions for data processing that will likely continue to work well enough for day-to-day use-cases. There's also no paved road today to automatically translate your Apache Spark applications over to Ray-native equivalents."
(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Typical deployment¶
- systems/amazon-emr — AWS's managed Hadoop/Spark runtime; the substrate for Amazon BDT's original compactor.
- systems/aws-glue — AWS's serverless Spark + catalog service.
- Databricks' own runtime (Databricks was founded by the Spark creators; see companies/databricks).
Seen in¶
- sources/2026-05-29-databricks-databricks-at-sigmod-2026 —
Spark's academic-publication face. Databricks attends SIGMOD
2026 as Platinum Sponsor with two first-author Spark-substrate
papers: SIGMOD 2026 honorable-mention "Enzyme: Incremental View
Maintenance for Data Engineering" (arXiv:2603.27775,
the IVM engine behind SDP's
@dp.materialized_view) and VLDB 2026 "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs" (Structured Streaming decade-retrospective). First wiki disclosure that modern Databricks Spark contributions are being published as named academic artefacts at the top systems venues — distinct from the prior on-blog-only mode. Establishes the two-track incremental-processing architecture beneath SDP: Enzyme on the MV track + Structured Streaming on the explicit-streaming track, mix-and-match in one pipeline. -
sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — Trust-the-optimiser-as-an-architectural-move face. The Octopus Energy MHHS margin-pipeline rebuild names Spark Adaptive Query Execution (AQE) as the optimiser that outperformed hand-tuned logic — "In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job." The team's action item was to delete code, not write more. First wiki canonicalisation of AQE as a system actor on its own page (previously only appeared in tag lists). Pairs with concepts/remove-before-add-optimization as the generalised measurement-required principle. Other Spark/Delta optimisations in the same rebuild: patterns/broadcast-join-for-small-reference-tables (under 500 MB threshold disclosed) and systems/liquid-clustering on filter/join columns. Embedded in systems/octopus-margin-data-pipeline which is itself the canonical instance of patterns/grain-aligned-stream-split + patterns/cdf-incremental-replacing-full-rescan.
-
sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — Spark Connect as driver-architectural rearchitecture face. Databricks frames Spark Connect as "the most significant architectural transformation in Spark's history" — the gRPC client-server split that replaces the classical monolithic driver model, decoupling user application code from the Spark driver process and changing "the unit of execution from application processes to queries". This is the substrate for Databricks Serverless Compute — the platform-managed Spark operating model that delivers "more than 25 major Spark runtime upgrades per year with a 99.998% success rate across more than 2 billion workloads". First wiki canonicalisation of Spark's client-server decoupling shift. Paired with workload-aware gateway routing (systems/databricks-serverless-gateway) and adaptive autoscaling (systems/databricks-serverless-autoscaler) with OOM-aware VM restart, this is Spark's post-monolithic production architecture at Databricks scale.
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — Structured Streaming as observability-ingestion substrate face. Hydra uses Spark Structured Streaming
- Databricks Auto Loader as the continuous-ingestion layer that writes 20 billion unaggregated active timeseries into Delta Lake. Structured Streaming's "streaming computations the same way you write batch jobs ... with continuous, incremental processing and exactly-once semantics for reliable ingestion" property is load-bearing — observability ingestion must be lossless to be trusted for incident triage. Ingestion is per-region-partitioned — "independent streaming jobs across geographies [which] enables each pipeline to autoscale independently, minimizes cross-region latency, and reduces blast radius in case of failures." Canonical instance of concepts/lakehouse-native-observability.
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Spark on systems/amazon-emr as the "compactor" for Amazon Retail's exabyte-scale tables (2019+); eventually displaced by systems/ray for the largest ~1% of tables that accounted for ~40% of Spark compaction cost and the majority of failures; Ray delivered 82% better cost efficiency with Spark still ahead on reliability (99.91% vs 99.15% in 2024).
-
sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution — SparkSQL as the feature-transformation language of Lyft's Feature Store. Customers define each batch feature with a SparkSQL query + JSON config; a Python cron service auto-generates an Airflow DAG per config that executes the SparkSQL over Hive tables and writes output to both offline (Hive) and online (
dsfeatures) paths. Concrete instance of Spark-as-feature-engineering-engine in an ML platform. Justification named in the post: "our core personas are particularly proficient in SQL and place a high value on quick iteration." -
sources/2024-08-31-meta-enforces-purpose-limitation-via-privacy-aware-infrastructure — Spark is named as a batch-processing integration target for Meta's Policy Zones IFC runtime, alongside Presto: "Batch-processing systems that process data rows in batch (mainly via SQL). Examples include real-time and data warehouse systems that power Meta's AI and analytics workloads." A Policy Zone is created per Spark job and annotations are evaluated at table/column/row granularity; violations follow the logging-mode → enforcement-mode rollout. First wiki instance of Spark-as-IFC-enforcement-substrate.
-
— Spark as the A/B-test analysis-system substrate at Zalando. The original Octopus analysis system (systems/octopus-zalando-experimentation-platform) hit scalability limits at concurrent-A/B-test load from architectural constraints; Zalando spent ~2 years rebuilding the analysis system on Spark. First wiki instance of Spark-as-experimentation- platform-backbone; complements the Lyft Spark-as-feature-store altitude with an analytics-pipeline-at-the-end-of-an-experiment altitude.
-
sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive — Spark / PySpark as the horizontal tier of Zalando ZEOS's inventory-optimisation feature engineering. Runs on Databricks transient job clusters writing to Delta Lake for the pre-processing layer of both the demand forecaster and the replenishment recommender pipelines. Canonical instance of the two-tier PySpark→SageMaker-Processing-Job feature-engineering split (see concepts/data-preprocessing-vs-data-transformation-split and patterns/pyspark-preprocessing-to-python-transformation-split). Zalando explicitly names Spark's scaling property verbatim: "PySpark enables horizontal scalability in the number of worker nodes as data volume grows."
-
sources/2026-05-14-databricks-expanded-interoperability-with-unity-catalog-open-apis — External-engine writer to UC Managed Tables face. Spark named (alongside Flink and DuckDB) as one of three external engines that "can create and write to UC managed Delta tables with centralized governance and automatic optimizations" in the 2026-05-14 Beta. Integrates via Delta Kernel; auth via UC Credential Vending (M2M OAuth + auto-refresh); commits via UC catalog commits. Version pinning: Delta-Spark 4.2 + Unity Catalog 0.4.1. Streaming- source-and-sink shape disclosed: "managed tables as both a streaming source and sink, enabling end-to-end real-time pipelines on Apache Spark."
-
sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — LLM-extraction-pipeline + entity-resolution face. VF Match Foundational Data Refresh runs Spark across two distinct altitudes in one pipeline: (1) LLM-driven information extraction over 25M+ web pages orchestrated by Lakeflow Jobs, processing classification / org-type / specialty extraction across 15+ interdependent tasks; (2) Splink probabilistic record linkage over ~thousands of facility / NGO records, where Spark's pairwise- comparison core hit the canonical curse-of-the-last-reducer straggler (30 minutes worst-case partition vs 52-second median — ~35× ratio). Mitigation was vectorisation rather than redistribution: enabling Photon cut the worst-case partition by 15× to ~2 minutes. The numbers canonicalise Photon at non-OLAP altitude (entity resolution, not BI queries) — first wiki-quantified instance.
Related¶
- systems/ray — the specialist replacement for Spark in the BDT compactor migration.
- systems/amazon-emr — Spark's canonical managed AWS substrate.
- systems/apache-iceberg, systems/apache-hudi, systems/delta-lake — open table formats Spark writes to natively.
- systems/delta-kernel — protocol-abstraction library for external-engine UC integration.
- systems/uc-managed-tables — managed Delta tables Spark can now externally write to (2026-05-14 Beta).
- systems/uc-credential-vending — auth substrate for external Spark writes.
- concepts/external-engine-write-to-managed-table — architectural shape Spark participates in.