Skip to content

PATTERN Cited by 1 source

Unified multi-task over multi-head

Pattern

Move from a multi-head architecture — where multiple task-specific heads sit on top of shared encoders and are trained with distinct losses, and only one head's output is used at serving time — to a unified single-head multi-task architecture, where task-specific supervision is applied during training but the single served representation directly benefits from the multi-task optimisation.

The two architectures are structurally distinct even though both train on multiple tasks:

Multi-head (older):                      Unified multi-task (newer):

    shared encoders                          shared encoders
         │                                         │
     ┌───┴───┐                              single unified head
     ▼       ▼                                     │
 head A   head B                      multi-task optimisation over all tasks
   │         │                                     │
 loss A   loss B                         served embeddings directly benefit
   │         │                           from the joint task supervision
   └──weighted loss──┘

  At serving: one head's
  embeddings chosen;
  other head's work discarded.

Multi-head's value comes from each task head having its own fitted subspace. Unified multi-task's value comes from the served representation directly embedding multi-task supervision. In sparse-primary-task regimes, the unified approach tends to win because per-head conversion embeddings become unstable in low-coverage regions — merging the heads forces the shared representation to carry the multi-task signal, which is more stable than any per-head slice.

Problem

In a multi-head architecture with a sparse primary task:

  • Each head's embedding is only as stable as its own task's gradient signal.
  • The primary (sparse) task's head embeds in regions where gradient variance is high; the embedding quality degrades where data is thin.
  • The auxiliary (dense) task's head is stable, but its output isn't used at serving — so the auxiliary's stabilisation effect is wasted at inference time (only indirectly transferred via the shared trunk).
  • The conversion head and engagement head end up representing the same items in subtly misaligned subspaces.

Pinterest identified exactly this in their shopping conversion CG:

"Through in-depth data analysis and several online experiments, we identified sparsity and noise in the conversion labels as one of the main bottlenecks of the previous model performance. To better stabilize query embeddings in regions of low conversion coverage, we moved from a multi-head architecture to a unified single-head multi-task architecture."

Solution

Replace N task-specific heads with a single unified head, then apply multi-task supervision as:

  1. Loss combination at the unified head's output. Multiple task losses computed against the same embedding set, combined with tuned task weights.
  2. Auxiliary-granularity losses layered at the unified head. Pinterest adds an advertiser-level loss as a parallel objective on the unified head.
  3. Single served representation. At inference the model produces one embedding set per input; no head-selection decision.

Pinterest's framing:

"By merging the conversion and engagement heads, it allows the final embeddings to directly benefit from the multi-task optimization during serving."

Canonical Pinterest migration — 2023 → 2025

2023 (multi-head, first launch):

  • Two heads: engagement + conversion.
  • Sampled softmax loss per head, weighted combination.
  • At serving: only conversion head's Pin and query embeddings used.

2025 (unified multi-task, refresh):

  • Single merged head.
  • Multi-task optimisation on the same embedding set.
  • Added advertiser-level loss as additional training objective.
  • At serving: single embedding set serves directly.

Combined with the parallel DCN + MLP cross architecture change, the 2025 refresh produced +42% recall@100 for conversion tasks (Pinterest internal data, US, 2023-2025).

Comparison to sibling Pinterest work

Pinterest's ads engagement model (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) keeps a multi-head (per-task) + surface-specific tower tree + per-surface checkpoint export structure — because in the engagement-model case, per-surface specialisation is the primary value add (different surfaces have distinct traffic distributions and calibration needs). The conversion-CG case is different: surfaces share a single model with surface-specific features, and the primary specialisation axis is tasks (conversion vs engagement) rather than surfaces. Unifying tasks makes sense when the tasks are semantically close and the shared representation carries the signal both directions need.

Framing: unified multi-task > multi-head when the tasks' embeddings need to converge for serving; multi-head > unified when the heads' specialisations are the whole value.

When to apply

  • Multi-head architecture where only one head's output is served and the other head's training work is discarded at inference.
  • Primary task is sparse / noisy in a way that makes per-head embeddings unstable in low-coverage regions.
  • Tasks are semantically close enough that their embeddings can productively share a subspace.
  • Ability to tune loss weights as a first-class design surface.

When NOT to apply

  • Tasks are structurally different — different output domains, different calibration requirements, different serving surfaces. Multi-head with per-task specialisation is better.
  • Multiple heads all served simultaneously (e.g. multi-output ranking with different objective functions) — unified single-head doesn't apply.
  • Task weighting is intractable to tune; multi-head with isolated heads is simpler to operate.

Caveats

  • Loss-weight tuning complexity — balancing task weights in a unified head is non-trivial; done wrong, one task dominates and the other regresses silently.
  • Task interference — unified head architectures can suffer gradient conflicts across tasks. Pinterest doesn't describe task-interference mitigations explicitly.
  • Calibration shifts — merging heads can change calibration; may require re-calibration work downstream.
  • Rollout risk — migrating a live multi-head model to unified is a non-trivial architectural change; must be staged + A/B-tested carefully. Pinterest doesn't describe rollout details.
  • Not a universal improvement — the pattern works for Pinterest's sparse-primary-task setup; for denser data or more-specialised-task setups, multi-head may still win.

Seen in

Last updated · 445 distilled / 1,275 read