PATTERN Cited by 1 source
Hybrid offline/online user tower inference¶
Pattern¶
Architect the user (or query) tower of a two-tower retrieval / ranking model so its forward pass splits into an offline-batched portion and an online portion:
- Offline portion: a heavy, history-dependent encoder (Transformer over long user-action sequence, or an analogous expensive model) runs in a daily / hourly batch. Its output (typically the last hidden state) is cached in a feature store, keyed by user.
- Online portion: a lightweight head that consumes (a) the cached offline output via feature-store lookup, (b) real-time context features that exist only at request time (current page, current query, current session state), runs at every retrieval request. Produces the final user embedding.
Then dot-product the user embedding against precomputed item embeddings (via the standard two-tower ANN-index path) to score candidates.
Why use it¶
Two-tower retrieval has historically forced a binary choice for user-side features:
- Fully offline user tower: cheap to serve, but stale and session-blind. Real-time intent signals (the page the user is on, the query they typed) cannot be consumed.
- Fully online user tower: real-time aware, but unaffordable when the user tower contains a heavy historical encoder (Transformer over long sequences) at retrieval-stage QPS.
The hybrid pattern is the structural answer when you need both: a heavy historical encoder that can't run online, and real-time context features that can't be batched.
Pinterest's framing on the Contextual Sequential CG (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models):
"Given that the context features (e.g., subject Pin features) are only known at the ad request time (online), we adopted a hybrid model inference approach. (1) Offline Inference: The majority of the user tower (the Transformer encoder) is inferred offline, and the last hidden state of the transformer (the encoded representations of the event sequence) is stored in the feature store. This is refreshed on a daily basis for users with new offsite activity. (2) Online Inference: The remaining part of the user tower — the context layer and the final MLP head — is computed online at serving time, taking the real-time context features and the pre-computed offline user signal as inputs."
Mechanism¶
Offline batch (daily, per user with new activity)
───────────────────────────────────────────────
user offsite history sequence
│
▼
[ Transformer encoder ] ← heavy, expensive
│
▼
last hidden state ──────────► feature store
(keyed by user_id)
Online (per ad-request)
──────────────────────
feature-store lookup(user_id) ──► cached user state ──┐
│
subject Pin features ─────────────────────────────────┤
user demographics ────────────────────────────────────┤
▼
[ context layer ]
(online-only)
│
▼
[ final MLP head ]
│
▼
user embedding
│
(standard two-tower from here:
dot product against item embeddings
via ANN index)
The split point — last hidden state of the Transformer — is the architectural hinge. Everything before runs offline; everything after runs online. The cached vector is the contract between the two halves.
When to use¶
- Heavy history-dependent encoder + real-time context features in the same user tower. The classic forcing function.
- Retrieval-stage QPS too high for full online user-tower forward. If you can afford to run the whole tower online (low QPS, small tower, latency budget), do that — hybrid adds complexity for cost reasons.
- User history is stable over your refresh interval. Pinterest refreshes daily because offsite-conversion sequences change slowly. If your history features change every minute, hybrid offers little: the cached vector goes stale immediately.
- Real-time context features are lightweight to encode online. The online portion has to fit in your retrieval-stage latency budget. If real-time context requires a heavy encoder at request time, hybrid still works but the savings shrink.
When not to use¶
- No real-time context features. Stay with the classic fully-offline-user-embedding-in-an-index pattern — simpler.
- Heavy historical encoder is already small enough to run online. Just run the whole tower online.
- Heavy historical encoder needs to react to the real-time context. Hybrid assumes the offline encoder produces a static vector that the online portion fuses with context — if the historical encoder itself needs to attend to context (e.g., context-conditioned history rewrite), the offline-cache assumption breaks. Cross-attention fusion (Pinterest's proposed future work) keeps the historical encoder offline but the online layer attends over its outputs — that still works with hybrid; full context-conditioning of the encoder doesn't.
Companion mechanisms¶
The hybrid pattern only works when paired with training-time mechanisms that handle the training-serving gap:
- patterns/synthetic-pseudo-context-from-label — real-time context features don't exist in training data; synthesise them from positive labels (or other training-time-available artefacts) so the model learns to consume them.
- patterns/high-dropout-on-augmented-feature-layer — high dropout on the context-consuming layer prevents the model from over-relying on the (label-leaked) synthetic context and abandoning the historical signal.
Without (1), the online portion has no learned consumer of context. Without (2), the model shortcuts to the synthetic signal and degrades at serving time when only real (non-leaked) context is available. The triplet — hybrid inference + pseudo-context augmentation + high dropout — is what makes this pattern shippable.
Hazards¶
Embedding / feature version skew¶
The cached user state is from one model checkpoint; the online context layer is from another. Skew is functionally analogous to two-tower's classic item-tower vs query-tower checkpoint skew. Mitigations: rebuild the offline cache as part of every model release; canary the joint (cached + online-portion) deployment, not the online portion in isolation. See patterns/batch-embedding-for-index-consistency for the related pattern.
Cached state size¶
Per-user vector × hundreds-of-millions-of-MAU = large feature store. Pinterest doesn't disclose the vector dimension or storage footprint.
Refresh-interval freshness ceiling¶
The cached portion has a staleness window equal to the refresh cadence. Pinterest accepts daily; tighter cadences cost more compute. The online context layer is what compensates for staleness in the historical signal.
Cold start¶
Users with no history have no cached vector. Either the online portion must produce a usable embedding alone, or a sentinel / fallback path is needed. Pinterest doesn't disclose cold-start posture.
Comparison to other inference patterns¶
- Classic two-tower with ANN index: items precomputed, user tower fully online. Hybrid extends this by also precomputing part of the user tower.
- Fully online both towers: feasible only if QPS and tower size permit. Pinterest's L1 ranker (lower QPS than retrieval) runs query tower fully online but caches Pin embeddings in an index.
- Offline scoring (no real-time anything): the historical regime before hybrid; CGs produced static embeddings in a batch, no real-time signal. Hybrid is the upgrade path.
Caveats¶
- Single named instance on the wiki. Pinterest is the only documented case under this exact name. Other large recsys teams almost certainly use similar patterns.
- Split granularity is a design surface. Pinterest splits at "last hidden state of the Transformer" — coarse, all-history offline / all-context online. Finer splits (e.g., partial Transformer offline, attention layer online) are theoretically possible but add complexity.
- Latency / cost envelope undisclosed. No p50/p99 of the online portion, no comparison of cached-state size to a fully-offline baseline embedding.
Seen in¶
- 2026-05-08 Pinterest — Enhancing Ad Relevance (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models) — canonical wiki instance. The Contextual Sequential Two-Tower CG splits the user tower at the boundary between the offline-cached Transformer encoder output and the online context layer + MLP head, with daily refresh of cached state and request-time fusion with subject-Pin context features.
Related¶
- concepts/two-tower-architecture
- concepts/hybrid-tower-inference-split
- concepts/context-layer-in-two-tower
- concepts/real-time-context-feature
- concepts/embedding-version-skew
- patterns/synthetic-pseudo-context-from-label
- patterns/high-dropout-on-augmented-feature-layer
- patterns/batch-embedding-for-index-consistency
- systems/pinterest-contextual-sequential-cg
- systems/transformer