Skip to content

CONCEPT Cited by 1 source

Hybrid tower inference split

Definition

Hybrid tower inference split is a serving-time architectural split applied to the user (or query) tower of a two-tower retrieval model: the expensive, history-dependent portion of the tower runs offline (with the result cached in a feature store), while the lightweight, context-dependent portion of the tower runs online at request time.

The split is structural, not deployment-mechanical: the model is architected so that offline and online computations meet at a defined intermediate representation, typically the last hidden state of an offline encoder concatenated with online-computed features before a final MLP head.

Pinterest's canonical formulation (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models):

"Offline Inference: The majority of the user tower (the Transformer encoder) is inferred offline, and the last hidden state of the transformer (the encoded representations of the event sequence) is stored in the feature store. This is refreshed on a daily basis for users with new offsite activity. Online Inference: The remaining part of the user tower — the context layer and the final MLP head — is computed online at serving time, taking the real-time context features and the pre-computed offline user signal as inputs."

Why the split is architectural, not optional

The split is forced by feature availability, not chosen for cost reasons:

  • Offline-required computation: heavy sequence encoders (Transformers over long user histories) are too expensive to run online at retrieval-stage QPS. They must be precomputed.
  • Online-required computation: real-time context features (the subject Pin the user is viewing, the search query they just typed, the current session state) only exist at request time. The component that consumes them must run online.

A model that wants to use both kinds of signal at the same retrieval stage has no choice but to split somewhere. The hybrid split is the named structural answer.

Generalisation of the precomputed-tower pattern

Two-tower's classic deployment pattern already precomputes one tower:

  • Item tower: outputs cached in an ANN index, rebuilt periodically.
  • Query / user tower: runs once per request.

Hybrid tower inference split applies the same precompute-and-cache logic to part of the user tower as well. The user tower stops being a single online forward pass and becomes a two-stage computation:

Offline batch (per user, daily)
  user history → heavy encoder → cached vector → feature store

Online (per request)
  feature-store lookup (cached vector) ──┐
                                         ├── online tower head → user embedding
  real-time context features ────────────┘

The savings: the heavy encoder runs once per user per refresh interval, not once per request. The cost: a freshness ceiling on the cached portion.

Freshness vs cost trade-off

The cached portion of the user tower has a staleness window equal to the refresh interval. Pinterest refreshes daily, "for users with new offsite activity" — anything happening within ~24 hours is invisible to the served embedding until the next batch.

The online context layer is what compensates for the staleness: real-time intent signals (subject Pin, search query) carry the "what the user wants right now" information that the cached state lacks. The split assumes a structural separation between slow-changing user state (history → cached) and fast-changing user intent (current page → online).

Hazards specific to hybrid tower inference split

Most of the standard two-tower hazards apply, with hybrid-specific variants:

Embedding / feature version skew

Like classic two-tower embedding version skew but with an extra surface: the cached user-state is from one model checkpoint, the online context layer is from another. If the two get out of sync (online code rolls forward before the feature store is rebuilt), the concatenated vector is shape-correct but distribution-wrong. Mitigations parallel two-tower's: see batch embedding for index consistency.

Cached-state size

The cached portion (last hidden state) is a per-user vector. At hundreds of millions of MAU this can be a large feature store. Trade-off: the offline state vector must be large enough to carry meaningful history information but small enough to store and serve at scale.

Training-serving parity for online context

Real-time context features exist only at serving time but the model needs to learn to use them at training time. This forces training-side hacks like synthetic pseudo-context augmentation (Pinterest's choice) or large-scale logging of real online context for replay training.

Cold start

Users with no history have no cached offline state. The online portion alone must produce a usable embedding — the model needs to gracefully handle a zero / missing cached vector. Pinterest doesn't disclose its cold-start posture in this post.

Caveats

  • Single instance on the wiki. Pinterest's contextual sequential CG is the only documented case. Other large-scale recsys teams likely use similar patterns under different names; the named primitive is not standard nomenclature.
  • Refresh-cadence freshness impact unquantified. Pinterest doesn't disclose the candidate-quality cost of stale-by-up-to-24-hour cached state.
  • Split granularity is a design decision. The intermediate representation (where the offline / online boundary lands) directly determines what features each side can use; finer-grained splits give more flexibility but more complexity. Pinterest splits at "last hidden state of the Transformer" — coarse, simple, all-history vs all-context.

Seen in

Last updated · 542 distilled / 1,571 read