Skip to content

PATTERN Cited by 1 source

Surface-specific tower tree

Pattern

Within a unified multi-surface ML model, give each product surface / view type its own tower tree — a surface-routed subnetwork above the shared trunk, with late fusion of surface-specific modules into the tower. At serving time, route each request to its surface-specific tower tree so the request only pays for its surface's specialisation, not N−1 other surfaces' specialisations.

Problem

A unified multi-surface model with a fully shared architecture all the way to the prediction head suffers two costs:

  1. Over-generalisation cost. One set of weights tries to serve all surfaces → none gets optimal specialisation. Some surface-specific patterns (feature interactions specific to Search's query-token signal, feature interactions specific to Related Pins' context-Pin signal) are under-fit.
  2. Serving-cost unfairness. If every surface runs through every architectural module, cheap surfaces pay for expensive surfaces' specialisation even though they don't use it.

Solution

Architecturally: shared trunk → surface-specific tower trees → surface-specific calibration → surface-specific prediction.

         [ shared trunk — features, embeddings, encoder, MMoE ]
                  (shared representation)
              ┌─────────────┼─────────────┐
              ▼             ▼             ▼
      surface-A tower  surface-B tower  surface-C tower
           +                +                +
      A-specific       B-specific       C-specific
      modules          modules          modules
       (late fusion)   (late fusion)   (late fusion)
              │             │             │
              ▼             ▼             ▼
      surface-A calib  surface-B calib  surface-C calib
              │             │             │
           A output      B output      C output

Pinterest's load-bearing framing (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"A single unified model that serves three surfaces, while still supporting the development of surface-specific modules (for example, surface-specific tower trees and late fusion with surface-specific modules within those tower trees). During serving, each surface-specific tower tree and its associated modules will handle only that surface's traffic, avoiding unnecessary compute cost from modules that don't benefit other surfaces."

Two routing mechanisms:

  • Training-time. Examples from each surface update only their surface's tower tree + the shared trunk. The trunk receives gradient signal from all surfaces; the tower trees receive only their surface's gradient.
  • Serving-time. A request's surface identity routes it to the correct tower tree. Only the trunk compute + its surface tower tree's compute is paid per request.

Relationship to MMoE

MMoE routes experts per task via gates. Surface-specific tower trees route whole subnetworks per surface. The two can stack: an MMoE inside the shared trunk + surface-specific tower trees on top. MMoE handles fine-grained expert specialisation over the trunk; tower trees handle coarse-grained subnetwork specialisation over the downstream work.

Structurally, surface-specific tower trees are MMoE at the tower granularity — one gate per surface selecting its own tower-tree subnetwork, with near-binary routing (each surface sees only its tower at serving time).

Relationship to multi-task heads

Multi-task heads typically share the tower and only differ at the prediction head. Surface-specific tower trees go deeper — they diverge earlier (at the tower level) so surface-specific feature interactions can be modelled. A unified model can combine both: multi-task heads for per-task prediction variety, tower trees for per-surface representation variety.

Canonical wiki reference

Pinterest's unified ads engagement model uses HF + SR tower trees at time of the post; RP tower tree is future work (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces). Late fusion of surface-specific modules into each tower tree is named explicitly, allowing surface-specific modules to join the tower without being forced through the shared trunk.

When to apply

  • Unified multi-surface model where surfaces have structurally different feature sets — surface-specific modules need to attach somewhere, and the tower tree is the natural attachment point.
  • Serving-cost fairness matters — cheap surfaces shouldn't pay for expensive surfaces' specialisation.
  • Surfaces have different user-intent priors that warrant different downstream compute paths.

Caveats

  • Pinterest doesn't disclose tower tree depth, per-surface module count, or the late-fusion mechanism.
  • Training-data imbalance risk. If one surface has much more traffic than others, its tower tree overfits faster than others' — requires per-surface loss weighting or sampling adjustment.
  • Parameter count scales linearly with surfaces. Each new surface adds a whole tower tree; at large N, the total parameter count grows.
  • Distinct from sparse MoE LLMs. The per-surface routing is not "sparsely activate top-k of N experts"; it's "activate exactly your surface's tower, skip the others."

Seen in

Last updated · 319 distilled / 1,201 read