SYSTEM Cited by 1 source
Photon¶
Photon is Databricks' C++-native vectorised query engine for Apache Spark. Photon transparently accelerates Spark SQL / DataFrame / Photon-eligible operations by replacing the Java/JVM execution path with a SIMD-vectorised, columnar, C++-compiled execution path; the engine is on/off via a runtime flag, the API surface is unchanged.
Stub page. Mentioned in passing across many Databricks ingests; first dedicated wiki page on the 2026-05-20 Virtue Foundation source where Photon is named as the load-bearing remediation for an entity-resolution straggler problem (30 min → 2 min worst-case partition, 15× improvement).
Architectural shape¶
- Vectorised execution. Operate on batches of rows (typically thousands at a time) using SIMD instructions — sum-of-vector across CPU cores instead of one row at a time. Maps cleanly to columnar in-memory representation: each batch is a column-major slab of values, processed by tight inner loops without per-row method dispatch overhead.
- C++-native. Bypasses the JVM's per-row object materialisation cost. The Photon execution path is a separate C++ binary that shares memory with the Spark JVM via off-heap / native interop.
- Transparent integration with Spark. Photon registers as an alternative physical operator implementation; queries are planned by Spark's Catalyst optimiser, then Photon-eligible operators are swapped for the C++ implementation. Photon-ineligible operators fall back to JVM Spark — no all-or-nothing decision.
- Runtime-toggleable. The engine is on/off via cluster configuration; customers can A/B compare on the same workload without code changes.
Why it matters for skew-bound workloads¶
The 2026-05-20 Virtue Foundation source canonicalises a non-OLAP use of Photon: absorbing pairwise-comparison partition skew in Splink entity resolution. The straggler shape ("one Spark partition running for 30 minutes while the median completed in 52 seconds — a textbook case of stragglers (the 'curse of the last reducer')") collapses by 15× with Photon enabled because:
- Pairwise string / numeric comparisons (the inner loop of record-linkage scoring) are branch-predictable, SIMD-friendly, and cache-coherent when operating on column-major batches. The JVM's per-row object dispatch overhead is the dominant cost on the hot partition; eliminating it is a step-change.
- Match-weight aggregation is a column-major reduce — sum per-column log-Bayes-factor across rows. SIMD reduce over a column-major slab is the textbook vectorised-engine win.
- Photon doesn't make the partition smaller — it makes the partition process faster. A 35× tail-amplification ratio becomes manageable not because the skew goes away but because the per-record work cost drops.
This is a cleaner-than-OLAP demonstration of vectorised execution's value: ER-shape pairwise scoring is a sterotypical batch-of-rows inner loop where every per-row call has SIMD-amenable arithmetic inside it.
Operational notes¶
- Not all operators are Photon-eligible. UDFs (especially Python UDFs), some complex types, and certain join shapes fall back to JVM Spark. Workloads dominated by non-Photon-eligible operators see less benefit.
- Cost / billing implication. Photon-enabled clusters bill at a higher DBU rate; the trade-off is vs the wall-clock reduction. For skew-bound workloads where the floor is the worst-case partition, Photon is often net-cheaper despite the rate.
- Memory shape differs. Photon's columnar memory layout has different memory-pressure characteristics than JVM Spark; very wide schemas with many sparse columns can hit memory differently.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — Canonical wiki source. Photon as remediation for ER partition-skew on Splink: "Enabling Photon, Databricks' vectorized query engine, reduced worst-case data partitions from 30 minutes to approximately 2 minutes: a 15x improvement." First wiki-quantified instance of Photon at non-OLAP altitude.
Related¶
- concepts/vectorized-query-engine — the engine class Photon belongs to (DuckDB, Velox, ClickHouse, Photon).
- systems/apache-spark — the host engine Photon accelerates.
- systems/databricks — vendor / packaging.
- systems/delta-lake — the table format Photon most commonly reads / writes.
- concepts/partition-skew-data-skew / concepts/curse-of-the-last-reducer — the failure modes Photon canonically remediates at reduce-stage altitude.
- concepts/columnar-storage — the in-memory representation vectorised execution exploits.
- systems/splink — the ER framework whose pairwise-comparison inner loop benefits 15× from Photon vectorisation in the VF Match FDR pipeline.
- systems/duckdb — sibling vectorised engine, single-machine scope rather than cluster.