Skip to content

CONCEPT Cited by 1 source

Vectorized query engine

Definition

A vectorized query engine is a query-execution architecture that operates on batches of rows at a time using SIMD / cache-friendly column-major loops, instead of one row at a time with per-row method dispatch (the row-at-a-time Volcano model). Vectorisation collapses the per-row overhead — function call dispatch, type coercion, branch prediction misses — to a per-batch cost amortised across thousands of rows.

Volcano vs vectorised — the model

Volcano (row-at-a-time)

for each row in source:
    for each operator in pipeline:
        operator.process(row)

Each operator's process() is a virtual method call per row. Branch prediction is poor (operators see one row at a time, no context). The CPU pipeline stalls on indirect branches.

Vectorised (batch-of-rows)

for each batch in source:                # batch size ~1024 rows
    for each operator in pipeline:
        operator.process_batch(batch)    # tight inner loop on column slabs

Each process_batch is a tight loop over a column-major slab of values. SIMD instructions (AVX2 / AVX-512 / NEON) operate on multiple values per CPU cycle. Branch prediction is good (the inner loop is predictable). The CPU pipeline runs hot.

Why it matters

  • 2-10× single-thread throughput on common analytical workloads vs Volcano-shape engines.
  • Cache-coherent. Column-major batches are contiguous in memory; L1 / L2 caches are used effectively.
  • Compiler-friendly. Loop bodies are simple enough that modern compilers auto-vectorise them; explicit SIMD intrinsics are an optimisation, not a requirement.
  • Composes with columnar storage. Parquet / Delta / Iceberg files are already column-major; vectorised execution reads native column slabs without row-major materialisation.

Canonical instances on the wiki

  • systems/photon — Databricks' C++-native vectorised engine for Apache Spark. The VF Match FDR pipeline disclosed Photon collapsing a Spark-on-Splink partition-skew straggler from 30 minutes to ~2 minutes — 15× improvement at the worst-case partition.
  • systems/duckdb — single-machine vectorised analytical database; a reference implementation of the vectorised engine shape (originally Hyper / MonetDB lineage).
  • Velox — Meta's open-source vectorised execution library; used as the Photon-equivalent in Presto.
  • ClickHouse — vectorised columnar OLAP database from the ground up.

Why ER / record-linkage workloads benefit specifically

The VF Match canonicalisation shows vectorised execution helping a non-OLAP workload (entity resolution via Splink pairwise comparison). Two structural reasons:

  1. Pairwise string / numeric comparisons are branch-predictable, SIMD-friendly inner loops when operating on column-major batches of candidate pairs.
  2. Match-weight aggregation (Fellegi-Sunter: W = Σ_f w_f) is a column-major reduce — sum per-column log-Bayes-factor across rows. A textbook SIMD reduce.

The 30 min → 2 min observation is a step-change because eliminating JVM per-row dispatch overhead on the hot partition was the dominant cost. Vectorisation doesn't change the partition's size; it changes the per-record work cost so that even a skewed-fat partition finishes in tractable time.

What vectorised engines don't fix

  • Skew itself. Vectorised execution makes each partition faster, but does not redistribute data. A partition with 35× the median work still has 35× the work — just at a faster constant.
  • UDF performance. Python / arbitrary-language UDFs typically fall back to row-at-a-time execution, breaking the vectorised pipeline. Common engineering smell: a Photon-eligible plan silently switches to JVM mode at a Python UDF.
  • I/O-bound workloads. When the bottleneck is reading from object storage, vectorisation barely helps.
  • Small-data workloads. The batch overhead is not free; for small queries (<1k rows) Volcano is comparable.

Failure modes

  • Hidden Volcano fallback. Operators not yet ported to the vectorised path silently fall back; the user expects vectorised speed and gets Volcano speed. Mitigation: explainplan inspection.
  • Memory shape divergence. Vectorised batches have different memory characteristics; very wide schemas with many sparse columns can hit memory limits where Volcano didn't.
  • Cost / billing surprise. Photon-enabled clusters bill at a higher DBU rate; for non-Photon-amenable workloads the higher rate is paid without speedup.

Seen in

Last updated · 542 distilled / 1,571 read