CONCEPT Cited by 1 source
Vectorized query engine¶
Definition¶
A vectorized query engine is a query-execution architecture that operates on batches of rows at a time using SIMD / cache-friendly column-major loops, instead of one row at a time with per-row method dispatch (the row-at-a-time Volcano model). Vectorisation collapses the per-row overhead — function call dispatch, type coercion, branch prediction misses — to a per-batch cost amortised across thousands of rows.
Volcano vs vectorised — the model¶
Volcano (row-at-a-time)¶
Each operator's process() is a virtual method call per row.
Branch prediction is poor (operators see one row at a time, no
context). The CPU pipeline stalls on indirect branches.
Vectorised (batch-of-rows)¶
for each batch in source: # batch size ~1024 rows
for each operator in pipeline:
operator.process_batch(batch) # tight inner loop on column slabs
Each process_batch is a tight loop over a column-major slab of
values. SIMD instructions (AVX2 / AVX-512 / NEON) operate on
multiple values per CPU cycle. Branch prediction is good (the
inner loop is predictable). The CPU pipeline runs hot.
Why it matters¶
- 2-10× single-thread throughput on common analytical workloads vs Volcano-shape engines.
- Cache-coherent. Column-major batches are contiguous in memory; L1 / L2 caches are used effectively.
- Compiler-friendly. Loop bodies are simple enough that modern compilers auto-vectorise them; explicit SIMD intrinsics are an optimisation, not a requirement.
- Composes with columnar storage. Parquet / Delta / Iceberg files are already column-major; vectorised execution reads native column slabs without row-major materialisation.
Canonical instances on the wiki¶
- systems/photon — Databricks' C++-native vectorised engine for Apache Spark. The VF Match FDR pipeline disclosed Photon collapsing a Spark-on-Splink partition-skew straggler from 30 minutes to ~2 minutes — 15× improvement at the worst-case partition.
- systems/duckdb — single-machine vectorised analytical database; a reference implementation of the vectorised engine shape (originally Hyper / MonetDB lineage).
- Velox — Meta's open-source vectorised execution library; used as the Photon-equivalent in Presto.
- ClickHouse — vectorised columnar OLAP database from the ground up.
Why ER / record-linkage workloads benefit specifically¶
The VF Match canonicalisation shows vectorised execution helping a non-OLAP workload (entity resolution via Splink pairwise comparison). Two structural reasons:
- Pairwise string / numeric comparisons are branch-predictable, SIMD-friendly inner loops when operating on column-major batches of candidate pairs.
- Match-weight aggregation (Fellegi-Sunter:
W = Σ_f w_f) is a column-major reduce — sum per-column log-Bayes-factor across rows. A textbook SIMD reduce.
The 30 min → 2 min observation is a step-change because eliminating JVM per-row dispatch overhead on the hot partition was the dominant cost. Vectorisation doesn't change the partition's size; it changes the per-record work cost so that even a skewed-fat partition finishes in tractable time.
What vectorised engines don't fix¶
- Skew itself. Vectorised execution makes each partition faster, but does not redistribute data. A partition with 35× the median work still has 35× the work — just at a faster constant.
- UDF performance. Python / arbitrary-language UDFs typically fall back to row-at-a-time execution, breaking the vectorised pipeline. Common engineering smell: a Photon-eligible plan silently switches to JVM mode at a Python UDF.
- I/O-bound workloads. When the bottleneck is reading from object storage, vectorisation barely helps.
- Small-data workloads. The batch overhead is not free; for small queries (<1k rows) Volcano is comparable.
Failure modes¶
- Hidden Volcano fallback. Operators not yet ported to the vectorised path silently fall back; the user expects vectorised speed and gets Volcano speed. Mitigation: explainplan inspection.
- Memory shape divergence. Vectorised batches have different memory characteristics; very wide schemas with many sparse columns can hit memory limits where Volcano didn't.
- Cost / billing surprise. Photon-enabled clusters bill at a higher DBU rate; for non-Photon-amenable workloads the higher rate is paid without speedup.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match FDR canonicalises vectorised execution at non-OLAP altitude (entity-resolution pairwise comparison on Splink), with a quantified 15× improvement on worst-case partition latency from enabling Photon.
Related¶
- systems/photon — Databricks' instance.
- systems/duckdb — single-machine reference instance.
- systems/apache-spark — the host engine Photon vectorises.
- concepts/columnar-storage — the in-memory representation vectorised execution exploits.
- concepts/curse-of-the-last-reducer — the failure mode vectorised execution canonically remediates by collapsing per-record work cost.
- concepts/partition-skew-data-skew — adjacent failure mode at the data-distribution altitude.