PATTERN Cited by 1 source
Staged model unification¶
Pattern¶
When consolidating N surface-specific / workload-specific ML models into a unified model, sequence the unification by serving-cost profile (CUDA throughput on GPUs, per-request compute cost generally). Pair surfaces with similar cost profiles first, defer expensive surfaces until efficiency work has stabilised the unified-model serving cost.
Problem¶
A naive "merge all surfaces at once" unification fails in production because:
- Unified models typically increase per-request compute. Merged feature maps + merged modules + joint feature coverage → bigger model. The unified baseline often regresses on serving latency vs any individual per-surface baseline.
- Efficiency work takes time. Request-level broadcasting, projection layers, fused kernels, quantisation — each is an independent multi-month engineering investment.
- Cost-heterogeneous surfaces can't share the same compute envelope. Unifying a cheap high-throughput surface with an expensive low-throughput surface forces the cheap surface to pay for the expensive surface's compute — the cheap surface blows its latency SLO.
- Atomic-rollout risk. Unifying all surfaces simultaneously means any regression affects all surfaces at once — blast radius of the whole project.
Solution¶
Sequence the unification in cost-matched waves:
- Benchmark each surface's serving cost. Measure CUDA throughput (or equivalent) in isolation per surface.
- Pair surfaces by matched cost profile. Surfaces with similar throughput characteristics can share architecture without cost mismatches.
- First wave: unify the cheap-throughput pair. Roll out the baseline unified model + efficiency work on the matched pair. The cost envelope is realistic for both.
- Land efficiency work. Projection layers, broadcasting, fused kernels — ship alongside or immediately after the first unification wave.
- Later waves: unify expensive-throughput surfaces once efficiency work has brought the unified-model cost profile within reach of the expensive surfaces' SLOs.
Measure CUDA throughput → pair HF + SR (similar) + defer RP (expensive)
│
▼
First unification wave: HF + SR → unified model v1
│
▼
Efficiency work lands on v1:
DCNv2 projection layer
Fused kernel embedding
TF32 training
Request-level broadcast
│
▼
Second unification wave: add RP to unified model v2
│
▼
Target architecture achieved
Canonical wiki reference¶
Pinterest sequenced the unification of its three ads surfaces by CUDA throughput (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"Since the cost of Related Pins (RP), Home Feed (HF), and Search (SR) differ substantially, we first unified Home Feed and Search (similar CUDA throughput characteristics) and expanded to Related Pins only after throughput and efficiency work stabilized."
Three guiding principles paired with the sequencing:
- Start simple. Establish a pragmatic baseline by merging strongest existing components.
- Iterate incrementally. Introduce surface-aware modeling (multi-task heads, surface-specific exports) only after the baseline demonstrates clear value.
- Maintain operational safety. Design for safe rollout, monitoring, and fast rollback at every step.
Why cost-matched pairing¶
- Latency budget feasibility. Matched-cost surfaces share the same latency-budget regime. The unified model's cost falls in a range that both surfaces can absorb.
- Incremental blast radius. A failure affects only the currently-unified surfaces, not all surfaces.
- Efficiency work can be targeted. Lessons from the first wave (which efficiency optimisations actually moved the needle) inform the second wave.
- Organisational learning. Teams acquire operational fluency with the unified model on easier surfaces before taking on harder ones.
Why not alphabetical / priority / traffic-volume ordering¶
- Alphabetical is arbitrary and ignores cost.
- Priority (most important surface first) risks blowing the biggest-revenue surface's SLO with a partially-optimised unified model.
- Traffic volume (highest-QPS surface first) can mean unifying the most cost-sensitive surface first, which is exactly the one least tolerant of serving-cost regression.
Cost-matched pairing optimises for minimum operational risk during unification.
When to apply¶
- Multi-surface / multi-workload model unification projects.
- Serving-cost profiles vary substantially across surfaces.
- Organisation has multiple quarters to execute the unification.
- Efficiency work is a known lever (projection layers, broadcasting, fused kernels, quantisation) that will land during the project.
When NOT to apply¶
- Similar-cost surfaces across the board. If all surfaces have matched cost profiles, staging by cost is a no-op.
- Time pressure / urgent deprecation. If the per-surface models must be retired on a fixed timeline, all-at-once unification with aggressive efficiency work may be required.
- Small number of surfaces (N=2). Little to sequence.
Related patterns / concepts¶
- patterns/unified-multi-surface-model — the outer pattern.
- patterns/surface-specific-tower-tree — the architectural mechanism inside each unified-model wave.
- concepts/cuda-throughput-budget — the sequencing axis.
Generalisations¶
- Staged datastore migration by workload. Pinterest's HBase deprecation (sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest) followed a similar shape at the storage layer: OLAP → Druid / StarRocks first, time-series → Goku, KV → KVStore, NewSQL (remaining) → TiDB. Cost-matched workload-specific migrations first; global substrate replacement last.
- Staged framework migration. Frontend / mobile framework migrations (Jetpack Compose, React Server Components) often sequence by surface complexity — simple screens first, complex screens after tooling stabilises.
- Staged system consolidation generally. Any time you're consolidating N independent systems into one, cost-matched pairing is a reliable sequencing heuristic.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: HF + SR unified first (similar CUDA throughput), RP deferred until efficiency work stabilised.
- 2024-05-14 Pinterest — HBase Deprecation at Pinterest (sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest) — workload-specific migration sequencing at the storage layer; conceptually parallel.