Skip to content

CONCEPT Cited by 5 sources

Two-tower architecture

Definition

A two-tower (or dual-encoder) model is a retrieval / ranking architecture with two independent neural encoders:

  • A query tower that encodes the user, context, or request into a fixed-dimension vector.
  • An item tower (Pin tower, document tower, candidate tower) that encodes each candidate item into a fixed-dimension vector.

A dot product (or cosine similarity) between the two vectors produces the score used to rank / narrow candidates. The architecture is the de-facto standard for large-scale retrieval + early-ranking stages at companies like Pinterest, Meta, Google, YouTube, and TikTok.

SilverTorch face — two towers inside one model graph (2026-05-26)

Meta's SilverTorch post (Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems) is not a rejection of the two-tower asymmetry — it is a substrate change. The two-tower economics that make the architecture affordable (item embeddings pre-computed once, query embedding once per request, similarity dot-product-cheap) are preserved. What changes is where those tensors live:

  • Pre-SilverTorch (microservice mesh): item embeddings live in a separately-deployed ANN-index service, queried by RPC from the user-tower service.
  • SilverTorch (Index as Model): item embeddings live as a tensor inside the retrieval model itself, alongside the user tower, eligibility filter, and scoring layer. The index lookup runs as one region of the forward pass.

The two-tower asymmetric pre-compute property is preserved at the model-graph level — item embeddings are still produced offline (or via streaming in-place updates) and the user query embedding is still computed once per request — but the cross-service hop disappears. See the patterns/unified-pytorch-model-as-retrieval-system pattern.

This face also resolves the embedding version skew failure mode catalogued in the structural-hazards section below: when both towers and the index live in one model graph, there is no v1-vs-v2-across-services question — there is one model, one cadence, one source of truth (with streaming weight updates handling freshness at sub-snapshot granularity).

Why it's used

The two-tower shape matches the cost structure of large-scale recommendation:

  • Item embeddings are pre-computed. The item tower runs offline (or in a background indexing pipeline) and its output is stored in an ANN index for fast retrieval. Once indexed, scoring N candidates at request time reduces to N dot products against a single query embedding — not N full model forward passes.
  • Query embedding runs once per request. The query tower runs on the live request to produce a single vector, then that vector is used to score / retrieve against the pre-computed index.
  • Asymmetric compute. Item-side computation is amortized across the catalog; query-side scales with request volume but not candidate volume.

This asymmetric decomposition is what makes two-tower affordable at Pinterest / YouTube / Meta scale where candidate pools are in the hundreds of millions to billions.

Typical deployment shape

Training time
  ┌──────────────┐     ┌──────────────┐
  │ Query tower  │     │ Item tower   │
  │ (checkpoint) │     │ (checkpoint) │
  └──────┬───────┘     └──────┬───────┘
         │                    │
         └── joint training on (query, item, label) triples ──
Serving time
  request → Query tower → query_vec ─┐
                                     ├─→ dot product → score → top-K
  indexed items → ANN index ─────────┘
                (item_vec populated by background job running Item tower
                 on periodic indexing snapshots)

Typical use cases (asymmetric in exactly the way two-tower is built for):

  • Retrieval — narrow millions → thousands of candidates.
  • Early-stage ranking (L1 at Pinterest) — narrow thousands → low-hundreds before the heavy L2 ranker.

Structural hazards

Embedding version skew

Because item embeddings are pre-computed into an index that updates on a slower cadence than the query tower, the two towers can end up running different model checkpoints at serving time. The item index might have embeddings from checkpoint X, X−1, X−2 scattered throughout (because large-tier index rebuilds can take days), while the query tower has already rolled forward to X+1. Dot products between misaligned checkpoints are not what training optimized.

Pinterest found this — named as embedding version skew — to be a material cause of online-offline discrepancy, and more severe for complex model families like DHEN than simpler variants (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr).

Mitigations:

  • Batch embedding for index consistency — build ANN indices from batch inference on a single checkpoint rather than streaming realtime-enriched embeddings that mix versions.
  • Version-skew sensitivity check — test each new model family against deliberate skew sweeps as part of readiness.
  • Limit skew-sensitive architectures. The more the model relies on subtle cross-tower interactions, the more skew hurts.

Feature pipeline divergence between towers

The item tower's feature pipeline (building embeddings into snapshots) is typically different from the query tower's feature pipeline (running live). Features available to training may not be available to both pipelines — a classic online-offline discrepancy root cause. Pinterest's L1 embedding path missed entire feature families (targeting-spec flags, conversion visit counts, image embeddings) that training logs included.

Expressive-power ceiling

Two-tower imposes a rigid late-interaction structure: towers don't see each other's features until the final dot product. Effective, but a weaker representation than full cross-attention models. The standard production compromise: use two-tower for retrieval + early ranking (where speed dominates), and a richer cross-encoder / deep network for final ranking (where quality dominates). This is exactly the retrieval → ranking funnel shape.

Training

Two-tower models are typically trained with contrastive losses over (query, positive item, negative items) triples — the towers learn to produce vectors that dot-product high for true positives and low for negatives. Negative sampling strategy (in-batch negatives, hard negatives, mixed-negatives) is a central design axis.

Alternatives

  • Cross-encoders. Single model that takes (query, item) jointly and produces a score. More expressive; not amenable to pre-computed indices; too expensive for retrieval / early ranking.
  • Multi-tower / N-tower. Extension with more than two encoders (e.g., separate towers for user, context, item, ad-advertiser); same dot-product structure, more factored signals.
  • ColBERT-style late interaction. Intermediate between two-tower and cross-encoder — multiple vectors per side, late interaction at token level. More expressive, higher cost.
  • Generative retrieval. A paradigm alternative — abandon scoring entirely and replace it with autoregressive token generation over a Semantic ID codebook. See Generative-retrieval divergence below.

Generative-retrieval divergence — Instacart 2026-06 (sibling-paradigm-divergence axis)

Instacart's 2026-06-02 source (sources/2026-06-02-instacart-from-scoring-to-spelling-rebuilding-ads-retrieval-at-instacart) is the first wiki canonical disclosure of a production system that abandons two-tower retrieval entirely, not as an optimisation but as a paradigm shift:

"We rebuilt the system, by moving from an encoder that scores products to a generative model that spells them out, token by token."

The prior CR system was BERT-family — sequence-model scoring with single probability-distribution output, structurally close to (though not exactly) the two-tower family in that both score a fixed vocabulary at request time. The successor, generative ads retrieval, decodes Semantic IDs token-by-token via beam search — there is no item tower, no ANN index, no dot-product scoring. The asymmetric pre-compute property that justified two-tower's serving economics is not preserved; instead, the autoregressive decoder runs at request time over a compressed codebook, with the GPU substrate absorbing the cost.

Where two-tower ends and generative begins

Axis Two-tower (incl. SilverTorch IaM) Generative retrieval (TIGER, Instacart 2026-06)
Item representation Embedding vector Semantic ID (codeword sequence)
Item-side compute Pre-computed offline None per-request (codebook is static)
Query-side compute Encoder forward pass Autoregressive decoder + beam search
Score / select Dot-product top-K (or in-graph variant) Beam search over decode steps
Tunable diversity Top-K Beam width + temperature
Cold-start (new items) Hard (need index re-pop) Easy (codebook covers from day 1)
Vocabulary scaling Catalog-bounded Codebook-bounded

Sibling, not replacement

Generative retrieval is a sibling paradigm to two-tower — both remain valid retrieval shapes for different contexts. Generative wins where: - The catalog is non-stationary (new items arrive faster than index re-pop). - Brand / category diversity matters more than surgical precision. - A GPU serving substrate is available.

Two-tower wins where: - Catalog is small / stationary. - Latency budgets are too tight for autoregressive decoding. - Item features don't admit a useful RQ-VAE codebook. - No GPU serving available.

The 2026-06 Instacart deployment is on browse surfaces specifically — "contexts where users are browsing rather than searching, and candidate diversity & contextual relevance matter more than surgical precision". Two-tower remains the right shape for narrow-intent search.

The Instacart pivot, viewed against the Meta SilverTorch 2026-05-26 source, gives the wiki two architecturally orthogonal alternatives to "score every item against the request":

  • SilverTorch (concepts/index-as-model) — keep two-tower asymmetric pre-compute; absorb the ANN index into the model graph as a tensor.
  • Instacart generative ads retrieval — abandon two-tower / ANN entirely; replace with autoregressive generation over a Semantic ID codebook.

Both are responses to the same structural failures of microservice scoring retrieval; they take diametrically different paths.

Tower-internal architecture — parallel cross-and-deep networks

Within each tower, feature-crossing + deep learning can be composed either sequentially (input → DCNv2 → MLP → head) or in parallel (DCNv2 and MLP both consume the raw input, outputs concatenated). Pinterest validated the parallel arrangement on the shopping conversion candidate generation two-tower model (sources/2026-04-27-pinterest-from-clicks-to-conversions-architecting-shopping-conversion-candidate-generation):

  • +11% offline recall@1000 vs the sequential DCNv2 → MLP baseline.
  • Adopted by all Pinterest production engagement retrieval models after the conversion-CG validation — a retrieval-stage architectural primitive.

Pinterest's reasoning: the sequential shape creates an information bottleneck ("the MLP could only learn from features already processed by DCN v2, potentially losing valuable signals from the original input"); the parallel shape lets the cross network construct higher-order feature crosses "without any information being lost or distorted by a preceding MLP transformation" while the MLP "learns implicit abstract patterns in parallel". See patterns/parallel-dcn-mlp-cross-layers for the full pattern, concepts/parallel-cross-and-deep-network for the concept framing.

This is a per-tower architectural primitive. It does not violate two-tower's no-cross-tower-interaction discipline at retrieval time; it's about how each tower individually computes its embedding.

Tower-internal architecture — context layers and hybrid offline/online inference

A second tower-internal extension is the context layer: a component inside the user (query) tower that consumes request-time-only features (the page the user is on, the search query, the immediate session state) and fuses them with a separately-encoded historical-state signal before the final tower head. Pinterest's Contextual Sequential CG (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models) is the canonical wiki instance — a Transformer encoder over offsite-conversion history concatenated with a context layer reading subject-Pin interest-category embeddings + user demographics.

Adding a context layer creates a serving-time problem: the heavy historical encoder is too expensive to run online, but the context layer's input only exists online. The structural answer is a hybrid offline/online user tower inference (patterns/hybrid-offline-online-user-tower-inference):

  • Offline batch: run the heavy historical encoder, cache the last hidden state per user in the feature store (Pinterest refreshes daily).
  • Online: feature-store lookup of the cached state, fuse with real-time context features in the context layer, run the final MLP head.

This generalises the precomputed-tower pattern of two-tower: instead of just the item tower being precomputed, part of the user tower is also precomputed. The dot product against the ANN-indexed item embeddings is unchanged. Companion training mechanisms make the context layer trainable when real-time context isn't in logged training data: synthetic pseudo-context derived from positive labels + high dropout on the context layer to mitigate label leakage. Survival-rate result on Related Pins: 2x more candidates retrieved → delivered to impression vs the offline-only baseline.

Context layer (purpose: real-time intent fusion) and parallel-DCN+MLP (purpose: feature crossing) are structurally distinct primitives that can in principle compose within the same tower.

Seen in

Last updated · 542 distilled / 1,571 read