CONCEPT Cited by 1 source
Curse of the last reducer¶
Definition¶
The curse of the last reducer is the canonical name for the failure mode in batch (MapReduce / Spark / Flink batch) jobs where the slowest partition / reducer dominates wall-clock job latency. In a job where partitions complete in highly variable times, the job finishes only when the slowest partition finishes — every other worker sits idle waiting.
The VF Match FDR pipeline canonicalises the framing verbatim:
"The core of record linkage is pairwise comparison, which creates inherently skewed workloads: common comparisons produce massive partitions while most others remain much smaller. Early runs made this painfully clear, with one Spark partition running for 30 minutes while the median completed in 52 seconds — a textbook case of stragglers (the 'curse of the last reducer') degrading job performance."
The math¶
If partitions complete in times T_1 ≤ T_2 ≤ ... ≤ T_N, then:
- Total work =
Σ T_i. - Wall clock =
T_N(the maximum). - Useful CPU utilisation =
(Σ T_i) / (N · T_N).
When skew is high (a 35× ratio between max and median, as in the VF Match observation), useful CPU utilisation drops dramatically — the rest of the cluster is idle while the straggler runs.
A median-completing-in-52s vs max-30min job has roughly:
- N · 52 seconds of evenly-distributed work.
- 30 · 60 = 1800 seconds wall clock.
- (N · 52) / (N · 1800) = 52/1800 ≈ 2.9% cluster CPU utilisation
during the straggler tail.
The cluster's effective parallelism collapses; you've paid for N workers and you're using one.
Where it comes from¶
- Skewed data distributions. Some keys (cities, surnames, null values) appear in disproportionately many records; their partitions become disproportionately heavy.
- Pairwise comparison workloads — entity resolution, joins, deduplication. The cardinality of pairs within a block can be quadratic in the block size; one disproportionately-large block produces a disproportionately-long partition.
- Self-joins / fan-out joins. A join key with one highly-frequent value generates a Cartesian product partition.
- Skewed partitioner functions.
hash(key) mod Nwith non-uniform key distribution produces hot partitions.
Distinguishing from sibling concepts¶
- concepts/partition-skew-data-skew — the broader concept of uneven distribution across partitions; curse of the last reducer is the specific operational consequence at the reduce / wrap-up stage of a batch job.
- concepts/tail-latency-at-scale — the online-serving equivalent: per-request tail-latency degrades user experience. Curse-of-the-last-reducer is its batch-job equivalent: per-job tail-partition-latency degrades cluster utilisation.
- concepts/hot-key — the per-key version: one key receives disproportionate traffic. A hot key produces a straggler partition.
Mitigation playbook¶
The remediation depends on what's causing the skew:
Vectorise the per-record work¶
Don't reduce the partition; reduce the cost per record. The VF Match FDR pipeline canonicalises this approach: enabling Photon (a vectorised query engine) cut the worst-case partition from 30 minutes to ~2 minutes — a 15× improvement without changing the partitioning. This works when the per-record operation is SIMD-amenable (string / numeric comparisons, column-major reduces).
Salt / split the hot key¶
Split the hot key into K artificial sub-keys
(key_001, key_002, …); rebalance pairs evenly across
sub-partitions; reaggregate. Costs an extra shuffle but
collapses the straggler.
Skew join optimisation¶
Modern query engines (Spark AQE, Photon AQE) detect skewed join keys at runtime and split the skewed partition into multiple tasks. AQE-skew-join is the automatic version of manual salting.
Block the workload differently¶
For pairwise-comparison workloads (ER), tighten blocking rules so no block is huge. Trade-off: tight blocking risks missing matches; loose blocking produces skew.
Speculative execution¶
Run a duplicate task on the suspected straggler; commit whichever finishes first. Costs CPU; reduces tail. Common in MapReduce.
Rewrite to streaming / per-record-emit¶
Some workloads fundamentally don't have to be batch — streaming emits per record, no reduce-tail.
Failure modes of the mitigations¶
- Salting + reaggregation increases shuffle cost. Net win
depends on
(time_saved_on_straggler - extra_shuffle_cost). - AQE-skew-join requires runtime statistics. When the engine doesn't have row-count estimates, AQE can't trigger.
- Tighter blocking misses true matches. ER recall degrades.
- Vectorisation has Photon-eligibility constraints. UDFs, some join shapes, etc. fall back to non-vectorised execution.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. Verbatim "textbook case of stragglers (the 'curse of the last reducer')" on Splink pairwise-comparison Spark partitions: 30 minutes max vs 52 seconds median — ~35× skew ratio. Photon vectorisation collapsed the straggler 15× to ~2 minutes.
Related¶
- concepts/partition-skew-data-skew — the broader concept; curse-of-the-last-reducer is the wall-clock consequence.
- concepts/tail-latency-at-scale — online-serving sibling.
- concepts/hot-key — per-key version.
- concepts/vectorized-query-engine — the reduce per-record cost mitigation.
- systems/photon — the canonical vectorised-engine instance remediation.
- systems/apache-spark — the canonical batch-job substrate exhibiting this failure mode.
- systems/splink — the ER framework whose pairwise-comparison inner loop canonically generates this skew shape.
- patterns/high-cardinality-partition-key — partition-side mitigation when the cause is skewed key cardinality.