Skip to content

CONCEPT Cited by 1 source

Context layer in two-tower

Definition

A context layer is a tower-internal architectural component inside a two-tower retrieval / ranking model that consumes real-time, request-time signals (the current page, the search query, the immediate session, demographic features) and produces an embedding concatenated (or otherwise fused) with the output of the historical-sequence encoder before the final tower head. The context layer is the structural mechanism that makes a two-tower model session-aware without giving up the precomputed historical state that makes two-tower affordable.

Pinterest's canonical instantiation (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models):

"The model now concatenates the output of the original Transformer encoder (which represents historical sequence information) with the output of the new context layer. This combined representation is then fed into the final Multi-Layer Perceptron (MLP) to derive the final user embedding."

Why it matters

Pure two-tower models historically faced a binary choice for user-side features: offline-precomputable (cheap to serve, but stale and session-blind) vs online-computed (expensive, but fresh and session-aware). The context layer is the architectural seam that lets you have both — the bulk of the user tower remains precomputable, while a thin context layer handles the request-time-only signals.

Without a context layer, two-tower retrieval on contextual surfaces (the page-the-user-is-on, the query-they-just-typed) suffers a candidate-survival-rate collapse — candidates retrieved without session awareness lose to downstream rankers that do see context. Pinterest measured this at "less than 1% of impressions on Related Pins" before the context layer was added.

Structural shape

The context layer's signature in Pinterest's design:

historical encoder output ──┐
                            ├── concatenate ── → final MLP head → user embedding
context layer output ───────┘
real-time features (subject Pin embeddings, demographics, ...)

The context layer's input on Related Pins: "the aggregated embedding representations of the top interest categories of the subject Pin, weighted by their confidence scores." User demographics (age, country, gender) are also incorporated.

Companion mechanisms

A context layer is structurally meaningful only when paired with:

  1. A training scheme that handles the training-serving gap. Real-time context exists only at serving time. Pinterest uses synthetic pseudo-context derived from the positive label.
  2. Regularisation against over-reliance. With label-derived pseudo-context, the model can shortcut to the leaked feature. High dropout on the context layer during training forces the model to keep using historical signal.
  3. A serving split. The context layer runs online; the heavy historical encoder runs offline. See concepts/hybrid-tower-inference-split / patterns/hybrid-offline-online-user-tower-inference.

These three are not optional add-ons — without all three, the context layer either can't be trained, over-fits on synthetic context, or can't be served at acceptable cost.

Concatenation vs cross-attention fusion

Pinterest's shipped design uses concatenation of context-layer output with Transformer output. The post explicitly proposes cross-attention fusion as future work:

"We propose using cross-attention-based fusion, where the context layer embedding acts as the query and the sequence of encoded transformer outputs serves as the key/value. This approach will allow the final user-tower embedding to dynamically capture the importance of each history event based on the real-time context."

Cross-attention generalises concatenation: the context tells the model which historical events matter most for this query. Concatenation gives the MLP head a static fused representation; cross-attention gives a context-dependent re-weighting of the history sequence.

Comparison to other tower-internal architectural primitives

  • Parallel DCNv2 + MLP cross layers (Pinterest shopping conversion CG): also a tower-internal composition, but the purpose is feature crossing (DCNv2 explicit crosses + MLP implicit patterns), not real-time-context fusion. Different primitive, similar shape.
  • Surface-specific tower trees (Pinterest ads engagement model): the surface signal is structurally encoded as a tree branch rather than as a context-layer input. Different decomposition; trees split the model, context layers fuse signals within a single user tower.

Caveats

  • Single instance on the wiki. Pinterest is the only documented instance of a named "context layer" in a two-tower. Other companies likely use similar mechanisms (real-time feature vectors fused into the user tower) but the named primitive is not standard nomenclature.
  • Context-layer dimensions undisclosed. Pinterest doesn't publish the layer width / depth / activation.
  • Concatenation is the simplest fusion choice. Cross-attention, gated fusion, and FiLM-style modulation are all plausible alternatives the post acknowledges but doesn't compare.

Seen in

Last updated · 542 distilled / 1,571 read