PATTERN Cited by 1 source
Surface-specific tower tree¶
Pattern¶
Within a unified multi-surface ML model, give each product surface / view type its own tower tree — a surface-routed subnetwork above the shared trunk, with late fusion of surface-specific modules into the tower. At serving time, route each request to its surface-specific tower tree so the request only pays for its surface's specialisation, not N−1 other surfaces' specialisations.
Problem¶
A unified multi-surface model with a fully shared architecture all the way to the prediction head suffers two costs:
- Over-generalisation cost. One set of weights tries to serve all surfaces → none gets optimal specialisation. Some surface-specific patterns (feature interactions specific to Search's query-token signal, feature interactions specific to Related Pins' context-Pin signal) are under-fit.
- Serving-cost unfairness. If every surface runs through every architectural module, cheap surfaces pay for expensive surfaces' specialisation even though they don't use it.
Solution¶
Architecturally: shared trunk → surface-specific tower trees → surface-specific calibration → surface-specific prediction.
[ shared trunk — features, embeddings, encoder, MMoE ]
│
▼
(shared representation)
│
┌─────────────┼─────────────┐
▼ ▼ ▼
surface-A tower surface-B tower surface-C tower
+ + +
A-specific B-specific C-specific
modules modules modules
(late fusion) (late fusion) (late fusion)
│ │ │
▼ ▼ ▼
surface-A calib surface-B calib surface-C calib
│ │ │
A output B output C output
Pinterest's load-bearing framing (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"A single unified model that serves three surfaces, while still supporting the development of surface-specific modules (for example, surface-specific tower trees and late fusion with surface-specific modules within those tower trees). During serving, each surface-specific tower tree and its associated modules will handle only that surface's traffic, avoiding unnecessary compute cost from modules that don't benefit other surfaces."
Two routing mechanisms:
- Training-time. Examples from each surface update only their surface's tower tree + the shared trunk. The trunk receives gradient signal from all surfaces; the tower trees receive only their surface's gradient.
- Serving-time. A request's surface identity routes it to the correct tower tree. Only the trunk compute + its surface tower tree's compute is paid per request.
Relationship to MMoE¶
MMoE routes experts per task via gates. Surface-specific tower trees route whole subnetworks per surface. The two can stack: an MMoE inside the shared trunk + surface-specific tower trees on top. MMoE handles fine-grained expert specialisation over the trunk; tower trees handle coarse-grained subnetwork specialisation over the downstream work.
Structurally, surface-specific tower trees are MMoE at the tower granularity — one gate per surface selecting its own tower-tree subnetwork, with near-binary routing (each surface sees only its tower at serving time).
Relationship to multi-task heads¶
Multi-task heads typically share the tower and only differ at the prediction head. Surface-specific tower trees go deeper — they diverge earlier (at the tower level) so surface-specific feature interactions can be modelled. A unified model can combine both: multi-task heads for per-task prediction variety, tower trees for per-surface representation variety.
Canonical wiki reference¶
Pinterest's unified ads engagement model uses HF + SR tower trees at time of the post; RP tower tree is future work (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces). Late fusion of surface-specific modules into each tower tree is named explicitly, allowing surface-specific modules to join the tower without being forced through the shared trunk.
When to apply¶
- Unified multi-surface model where surfaces have structurally different feature sets — surface-specific modules need to attach somewhere, and the tower tree is the natural attachment point.
- Serving-cost fairness matters — cheap surfaces shouldn't pay for expensive surfaces' specialisation.
- Surfaces have different user-intent priors that warrant different downstream compute paths.
Caveats¶
- Pinterest doesn't disclose tower tree depth, per-surface module count, or the late-fusion mechanism.
- Training-data imbalance risk. If one surface has much more traffic than others, its tower tree overfits faster than others' — requires per-surface loss weighting or sampling adjustment.
- Parameter count scales linearly with surfaces. Each new surface adds a whole tower tree; at large N, the total parameter count grows.
- Distinct from sparse MoE LLMs. The per-surface routing is not "sparsely activate top-k of N experts"; it's "activate exactly your surface's tower, skip the others."
Related patterns / concepts¶
- patterns/unified-multi-surface-model — the outer pattern this fits inside.
- patterns/surface-specific-checkpoint-export — the complementary deployment mechanism.
- concepts/multi-task-learning — the broader framing.
- concepts/surface-specific-calibration — per-surface calibration on top of the tower tree.
- concepts/mixture-of-experts — MMoE as the sibling mechanism in the shared trunk.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: surface-specific tower trees + late fusion of surface-specific modules within those trees, each handling only its surface's traffic.