CONCEPT

Nested-loop join¶

Definition¶

A nested-loop join (NL join) is the join algorithm that iterates over the left-hand side (LHS) rows one at a time and, for each LHS row, issues a lookup against the right-hand side (RHS) using the row's join-key value as a bound parameter. It is the conceptually simplest join algorithm:

for each row L in LHS:
    for each row R in RHS where R.join_key = L.join_key:
        emit (L, R)

When the inner loop is supported by an index on R.join_key, each lookup is O(log N) on a B-tree, making the total NL join cost O(|LHS| × log |RHS|). Without such an index, it degrades to O(|LHS| × |RHS|) — the full cartesian product.

NL join vs other join algorithms¶

Single-node database engines have three classical join algorithms:

Nested-loop join — iterative; best when LHS is small and RHS has an index on the join key.
Sort-merge join — sort both sides by join key + merge; best when both sides are already sorted or can be sorted cheaply.
Hash join — build a hash table on the smaller side, probe with the larger side; best when both sides are large and no join-key index exists.

MySQL's single-node planner picks among these based on cardinalities and index availability. The planner prefers NL joins when there's a usable index on the inner side's join key.

NL join in sharded / distributed systems¶

The VTGate query planner uses cross-shard NL joins as the default strategy when the two sides of a join cannot be colocated on the same shard:

"The join is a nested loop join, which means that we'll execute the query on the left-hand side (LHS) of the Join, and using that result we'll issue queries on the right-hand side (RHS), one query per row." (Source: )

Each LHS row triggers one RHS query over the network to the appropriate shard. The N+1 queries shape is the canonical cost: for an LHS of size N, the join issues N RHS queries.

Why NL is the canonical cross-shard join¶

Alternatives are harder:

Cross-shard sort-merge requires ordered scans of both sides, which for a sharded table means scatter + per-shard sort + merge at the coordinator — expensive for large tables, and unavailable for queries with arbitrary WHERE clauses.
Cross-shard hash join requires either (a) materialising one side's build table at the coordinator (memory pressure) or (b) a distributed hash-build + partitioned probe protocol (complex, expensive network shuffle).
NL join requires only per-row RHS queries, each of which is a normal single-shard lookup. The network cost is linear in LHS size × per-query RTT.

For workloads where LHS is small (e.g. point-lookup-style), cross-shard NL join is efficient. For workloads where LHS is large, cross-shard NL join's N+1-query shape becomes a latency and QPS bottleneck — this is where planner rewrites like aggregation-pushdown-under-join pay off: pre-aggregation reduces LHS cardinality by the group-collapsing factor, reducing the RHS-query count proportionally.

The N+1 problem reframed¶

Application-tier ORM anti-patterns often produce the same shape: a parent-table query followed by one child-table query per parent row. The database community has deprecated this as the N+1 problem. Cross-shard NL joins are structurally identical — one parent query plus N child queries — but happen inside the query engine rather than at the application tier. The mitigation is different (planner rewrites that batch the RHS queries, colocate the tables, or pre-aggregate to reduce N) but the cost structure is the same.

Cost dominants¶

Per-RHS-query network RTT — cross-shard: ~1-10 ms; single-node: microseconds.
Per-RHS-query parse + plan + execute — cross-shard: an LHS with 1M rows triggers 1M MySQL query-parse events.
Total wall-clock time — serial NL: N × per-query latency; parallel NL: N × (per-query latency / concurrency) bounded by RHS shard capacity.

Vitess supports parallel RHS query issuance within the NL join (up to a configured concurrency), which helps latency but not total QPS on the RHS shards.

When NL is the right shape¶

Cross-shard NL is the right answer when:

LHS is small (e.g. thousands of rows, not millions).
RHS has an index on the join key (cheap per-lookup).
The two sides cannot be colocated (shard keys don't align).
Pre-aggregation via concepts/push-aggregation-under-join can collapse LHS sufficiently.

Cross-shard NL is the wrong answer when:

LHS is large and cannot be reduced via pre-aggregation.
RHS lacks a join-key index.
The join is many-to-many (cartesian-product risk).

In those cases, the query either needs to be rewritten (denormalise, colocate, batch application-side) or accepted as expensive.

Seen in¶

— Andres Taylor (2022-06-24) canonicalises the VTGate NL join as the cross-shard-join default: "we'll execute the query on the left-hand side (LHS) of the Join, and using that result we'll issue queries on the right-hand side (RHS), one query per row." The post's aggregation-pushdown-under-join rewrite is specifically motivated by reducing the LHS cardinality that drives NL join cost.

concepts/cross-shard-query — the regime where NL joins become expensive.
concepts/scatter-gather-query — the alternative query shape when no join is needed.
concepts/vtgate-query-planner — the component that picks NL as the cross-shard join shape.
concepts/push-aggregation-under-join — the rewrite that reduces NL join cost.
patterns/aggregation-pushdown-under-join — the production pattern using this rewrite.
systems/vitess
systems/mysql