Skip to content

CONCEPT Cited by 1 source

Two-tower architecture

Definition

A two-tower (or dual-encoder) model is a retrieval / ranking architecture with two independent neural encoders:

  • A query tower that encodes the user, context, or request into a fixed-dimension vector.
  • An item tower (Pin tower, document tower, candidate tower) that encodes each candidate item into a fixed-dimension vector.

A dot product (or cosine similarity) between the two vectors produces the score used to rank / narrow candidates. The architecture is the de-facto standard for large-scale retrieval + early-ranking stages at companies like Pinterest, Meta, Google, YouTube, and TikTok.

Why it's used

The two-tower shape matches the cost structure of large-scale recommendation:

  • Item embeddings are pre-computed. The item tower runs offline (or in a background indexing pipeline) and its output is stored in an ANN index for fast retrieval. Once indexed, scoring N candidates at request time reduces to N dot products against a single query embedding — not N full model forward passes.
  • Query embedding runs once per request. The query tower runs on the live request to produce a single vector, then that vector is used to score / retrieve against the pre-computed index.
  • Asymmetric compute. Item-side computation is amortized across the catalog; query-side scales with request volume but not candidate volume.

This asymmetric decomposition is what makes two-tower affordable at Pinterest / YouTube / Meta scale where candidate pools are in the hundreds of millions to billions.

Typical deployment shape

Training time
  ┌──────────────┐     ┌──────────────┐
  │ Query tower  │     │ Item tower   │
  │ (checkpoint) │     │ (checkpoint) │
  └──────┬───────┘     └──────┬───────┘
         │                    │
         └── joint training on (query, item, label) triples ──
Serving time
  request → Query tower → query_vec ─┐
                                     ├─→ dot product → score → top-K
  indexed items → ANN index ─────────┘
                (item_vec populated by background job running Item tower
                 on periodic indexing snapshots)

Typical use cases (asymmetric in exactly the way two-tower is built for):

  • Retrieval — narrow millions → thousands of candidates.
  • Early-stage ranking (L1 at Pinterest) — narrow thousands → low-hundreds before the heavy L2 ranker.

Structural hazards

Embedding version skew

Because item embeddings are pre-computed into an index that updates on a slower cadence than the query tower, the two towers can end up running different model checkpoints at serving time. The item index might have embeddings from checkpoint X, X−1, X−2 scattered throughout (because large-tier index rebuilds can take days), while the query tower has already rolled forward to X+1. Dot products between misaligned checkpoints are not what training optimized.

Pinterest found this — named as embedding version skew — to be a material cause of online-offline discrepancy, and more severe for complex model families like DHEN than simpler variants (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr).

Mitigations:

  • Batch embedding for index consistency — build ANN indices from batch inference on a single checkpoint rather than streaming realtime-enriched embeddings that mix versions.
  • Version-skew sensitivity check — test each new model family against deliberate skew sweeps as part of readiness.
  • Limit skew-sensitive architectures. The more the model relies on subtle cross-tower interactions, the more skew hurts.

Feature pipeline divergence between towers

The item tower's feature pipeline (building embeddings into snapshots) is typically different from the query tower's feature pipeline (running live). Features available to training may not be available to both pipelines — a classic online-offline discrepancy root cause. Pinterest's L1 embedding path missed entire feature families (targeting-spec flags, conversion visit counts, image embeddings) that training logs included.

Expressive-power ceiling

Two-tower imposes a rigid late-interaction structure: towers don't see each other's features until the final dot product. Effective, but a weaker representation than full cross-attention models. The standard production compromise: use two-tower for retrieval + early ranking (where speed dominates), and a richer cross-encoder / deep network for final ranking (where quality dominates). This is exactly the retrieval → ranking funnel shape.

Training

Two-tower models are typically trained with contrastive losses over (query, positive item, negative items) triples — the towers learn to produce vectors that dot-product high for true positives and low for negatives. Negative sampling strategy (in-batch negatives, hard negatives, mixed-negatives) is a central design axis.

Alternatives

  • Cross-encoders. Single model that takes (query, item) jointly and produces a score. More expressive; not amenable to pre-computed indices; too expensive for retrieval / early ranking.
  • Multi-tower / N-tower. Extension with more than two encoders (e.g., separate towers for user, context, item, ad-advertiser); same dot-product structure, more factored signals.
  • ColBERT-style late interaction. Intermediate between two-tower and cross-encoder — multiple vectors per side, late interaction at token level. More expressive, higher cost.

Seen in

Last updated · 319 distilled / 1,201 read