Skip to content

PATTERN Cited by 1 source

Staged model unification

Pattern

When consolidating N surface-specific / workload-specific ML models into a unified model, sequence the unification by serving-cost profile (CUDA throughput on GPUs, per-request compute cost generally). Pair surfaces with similar cost profiles first, defer expensive surfaces until efficiency work has stabilised the unified-model serving cost.

Problem

A naive "merge all surfaces at once" unification fails in production because:

  1. Unified models typically increase per-request compute. Merged feature maps + merged modules + joint feature coverage → bigger model. The unified baseline often regresses on serving latency vs any individual per-surface baseline.
  2. Efficiency work takes time. Request-level broadcasting, projection layers, fused kernels, quantisation — each is an independent multi-month engineering investment.
  3. Cost-heterogeneous surfaces can't share the same compute envelope. Unifying a cheap high-throughput surface with an expensive low-throughput surface forces the cheap surface to pay for the expensive surface's compute — the cheap surface blows its latency SLO.
  4. Atomic-rollout risk. Unifying all surfaces simultaneously means any regression affects all surfaces at once — blast radius of the whole project.

Solution

Sequence the unification in cost-matched waves:

  1. Benchmark each surface's serving cost. Measure CUDA throughput (or equivalent) in isolation per surface.
  2. Pair surfaces by matched cost profile. Surfaces with similar throughput characteristics can share architecture without cost mismatches.
  3. First wave: unify the cheap-throughput pair. Roll out the baseline unified model + efficiency work on the matched pair. The cost envelope is realistic for both.
  4. Land efficiency work. Projection layers, broadcasting, fused kernels — ship alongside or immediately after the first unification wave.
  5. Later waves: unify expensive-throughput surfaces once efficiency work has brought the unified-model cost profile within reach of the expensive surfaces' SLOs.
     Measure CUDA throughput → pair HF + SR (similar) + defer RP (expensive)
       First unification wave: HF + SR → unified model v1
              Efficiency work lands on v1:
                DCNv2 projection layer
                Fused kernel embedding
                TF32 training
                Request-level broadcast
     Second unification wave: add RP to unified model v2
             Target architecture achieved

Canonical wiki reference

Pinterest sequenced the unification of its three ads surfaces by CUDA throughput (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"Since the cost of Related Pins (RP), Home Feed (HF), and Search (SR) differ substantially, we first unified Home Feed and Search (similar CUDA throughput characteristics) and expanded to Related Pins only after throughput and efficiency work stabilized."

Three guiding principles paired with the sequencing:

  1. Start simple. Establish a pragmatic baseline by merging strongest existing components.
  2. Iterate incrementally. Introduce surface-aware modeling (multi-task heads, surface-specific exports) only after the baseline demonstrates clear value.
  3. Maintain operational safety. Design for safe rollout, monitoring, and fast rollback at every step.

Why cost-matched pairing

  • Latency budget feasibility. Matched-cost surfaces share the same latency-budget regime. The unified model's cost falls in a range that both surfaces can absorb.
  • Incremental blast radius. A failure affects only the currently-unified surfaces, not all surfaces.
  • Efficiency work can be targeted. Lessons from the first wave (which efficiency optimisations actually moved the needle) inform the second wave.
  • Organisational learning. Teams acquire operational fluency with the unified model on easier surfaces before taking on harder ones.

Why not alphabetical / priority / traffic-volume ordering

  • Alphabetical is arbitrary and ignores cost.
  • Priority (most important surface first) risks blowing the biggest-revenue surface's SLO with a partially-optimised unified model.
  • Traffic volume (highest-QPS surface first) can mean unifying the most cost-sensitive surface first, which is exactly the one least tolerant of serving-cost regression.

Cost-matched pairing optimises for minimum operational risk during unification.

When to apply

  • Multi-surface / multi-workload model unification projects.
  • Serving-cost profiles vary substantially across surfaces.
  • Organisation has multiple quarters to execute the unification.
  • Efficiency work is a known lever (projection layers, broadcasting, fused kernels, quantisation) that will land during the project.

When NOT to apply

  • Similar-cost surfaces across the board. If all surfaces have matched cost profiles, staging by cost is a no-op.
  • Time pressure / urgent deprecation. If the per-surface models must be retired on a fixed timeline, all-at-once unification with aggressive efficiency work may be required.
  • Small number of surfaces (N=2). Little to sequence.

Generalisations

  • Staged datastore migration by workload. Pinterest's HBase deprecation (sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest) followed a similar shape at the storage layer: OLAP → Druid / StarRocks first, time-series → Goku, KV → KVStore, NewSQL (remaining) → TiDB. Cost-matched workload-specific migrations first; global substrate replacement last.
  • Staged framework migration. Frontend / mobile framework migrations (Jetpack Compose, React Server Components) often sequence by surface complexity — simple screens first, complex screens after tooling stabilises.
  • Staged system consolidation generally. Any time you're consolidating N independent systems into one, cost-matched pairing is a reliable sequencing heuristic.

Seen in

Last updated · 319 distilled / 1,201 read