Skip to content

CONCEPT Cited by 1 source

Curse of the last reducer

Definition

The curse of the last reducer is the canonical name for the failure mode in batch (MapReduce / Spark / Flink batch) jobs where the slowest partition / reducer dominates wall-clock job latency. In a job where partitions complete in highly variable times, the job finishes only when the slowest partition finishes — every other worker sits idle waiting.

The VF Match FDR pipeline canonicalises the framing verbatim:

"The core of record linkage is pairwise comparison, which creates inherently skewed workloads: common comparisons produce massive partitions while most others remain much smaller. Early runs made this painfully clear, with one Spark partition running for 30 minutes while the median completed in 52 seconds — a textbook case of stragglers (the 'curse of the last reducer') degrading job performance."

The math

If partitions complete in times T_1 ≤ T_2 ≤ ... ≤ T_N, then:

  • Total work = Σ T_i.
  • Wall clock = T_N (the maximum).
  • Useful CPU utilisation = (Σ T_i) / (N · T_N).

When skew is high (a 35× ratio between max and median, as in the VF Match observation), useful CPU utilisation drops dramatically — the rest of the cluster is idle while the straggler runs.

A median-completing-in-52s vs max-30min job has roughly: - N · 52 seconds of evenly-distributed work. - 30 · 60 = 1800 seconds wall clock. - (N · 52) / (N · 1800) = 52/1800 ≈ 2.9% cluster CPU utilisation during the straggler tail.

The cluster's effective parallelism collapses; you've paid for N workers and you're using one.

Where it comes from

  • Skewed data distributions. Some keys (cities, surnames, null values) appear in disproportionately many records; their partitions become disproportionately heavy.
  • Pairwise comparison workloads — entity resolution, joins, deduplication. The cardinality of pairs within a block can be quadratic in the block size; one disproportionately-large block produces a disproportionately-long partition.
  • Self-joins / fan-out joins. A join key with one highly-frequent value generates a Cartesian product partition.
  • Skewed partitioner functions. hash(key) mod N with non-uniform key distribution produces hot partitions.

Distinguishing from sibling concepts

  • concepts/partition-skew-data-skew — the broader concept of uneven distribution across partitions; curse of the last reducer is the specific operational consequence at the reduce / wrap-up stage of a batch job.
  • concepts/tail-latency-at-scale — the online-serving equivalent: per-request tail-latency degrades user experience. Curse-of-the-last-reducer is its batch-job equivalent: per-job tail-partition-latency degrades cluster utilisation.
  • concepts/hot-key — the per-key version: one key receives disproportionate traffic. A hot key produces a straggler partition.

Mitigation playbook

The remediation depends on what's causing the skew:

Vectorise the per-record work

Don't reduce the partition; reduce the cost per record. The VF Match FDR pipeline canonicalises this approach: enabling Photon (a vectorised query engine) cut the worst-case partition from 30 minutes to ~2 minutes — a 15× improvement without changing the partitioning. This works when the per-record operation is SIMD-amenable (string / numeric comparisons, column-major reduces).

Salt / split the hot key

Split the hot key into K artificial sub-keys (key_001, key_002, …); rebalance pairs evenly across sub-partitions; reaggregate. Costs an extra shuffle but collapses the straggler.

Skew join optimisation

Modern query engines (Spark AQE, Photon AQE) detect skewed join keys at runtime and split the skewed partition into multiple tasks. AQE-skew-join is the automatic version of manual salting.

Block the workload differently

For pairwise-comparison workloads (ER), tighten blocking rules so no block is huge. Trade-off: tight blocking risks missing matches; loose blocking produces skew.

Speculative execution

Run a duplicate task on the suspected straggler; commit whichever finishes first. Costs CPU; reduces tail. Common in MapReduce.

Rewrite to streaming / per-record-emit

Some workloads fundamentally don't have to be batch — streaming emits per record, no reduce-tail.

Failure modes of the mitigations

  • Salting + reaggregation increases shuffle cost. Net win depends on (time_saved_on_straggler - extra_shuffle_cost).
  • AQE-skew-join requires runtime statistics. When the engine doesn't have row-count estimates, AQE can't trigger.
  • Tighter blocking misses true matches. ER recall degrades.
  • Vectorisation has Photon-eligibility constraints. UDFs, some join shapes, etc. fall back to non-vectorised execution.

Seen in

Last updated · 542 distilled / 1,571 read