SYSTEM Cited by 4 sources
Transformer (architecture)¶
Definition¶
Transformer is the neural network architecture introduced by Vaswani et al. (Attention Is All You Need, 2017, arXiv:1706.03762) based on stacked self-attention + feed-forward layers. The Transformer is the load-bearing architectural primitive under LLMs, modern video/audio encoders (MediaFM, wav2vec2), and long-user-sequence modeling in recsys/ads ranking.
This page is a minimal wiki stub; the canonical architecture is described extensively elsewhere. Pages on the wiki use "Transformer" in several distinct contexts — LLM serving, multimodal encoders, sequence encoders in ranking — each with its own operational profile.
Use at Pinterest — long user sequence modeling¶
Pinterest's unified ads engagement model uses a Transformer over long user sequences as one component of the shared trunk (long-user-sequence modeling). The Transformer's outputs feed into a DCNv2 projection layer, then into downstream feature crossing + surface-specific tower trees.
The Pinterest post treats the long-sequence Transformer as a cost-heavy encoder whose outputs must be projected (via DCNv2) before downstream work, confirming the common ads-ranking pattern: Transformer-as-feature-encoder, not Transformer-as-ranker (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces).
The Pinterest post notes that long-sequence Transformers did not produce consistent gains when applied in isolation on one surface — they paid off only when integrated into a unified model trained on multi-surface combined features, where the broader feature distribution gave the Transformer enough signal diversity to clear its cost bar.
Caveats¶
- Stub — the canonical Transformer architecture description is not fully documented here; the wiki references the architecture across many pages (MediaFM, Airbnb Destination Recommendation, various LLM-serving pages) each with context-specific detail.
- Topology is context-specific. Layer count, head count, hidden dim, sequence length depend on use case — Pinterest doesn't disclose the long-user-sequence Transformer's topology in the 2026-03-03 post.
- Pinterest's long-sequence variant is the subject of a prior Pinterest blog post (User Action Sequence Modeling for Pinterest Ads Engagement Modeling) not ingested on the wiki.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — long-sequence Transformer as ads-ranking feature encoder, output projected via DCNv2 before downstream crossing.
- 2026-04-27 Pinterest — From Clicks to Conversions (sources/2026-04-27-pinterest-from-clicks-to-conversions-architecting-shopping-conversion-candidate-generation) — Transformer used to encode sequential user-action data into a user-history embedding as one of the preference + historical user-side features in the shopping conversion candidate generation two-tower retrieval model. Confirms the shape: Transformer-as-user-history-encoder is Pinterest's canonical long-user-sequence mechanism across multiple ads-ML models.
- 2026-05-08 Pinterest — Enhancing Ad Relevance (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models) — third Pinterest role: offline-cached encoder in a hybrid offline/online user tower for the Contextual Sequential CG. The Transformer encodes the user's offsite-conversion history and its last hidden state is cached daily in the feature store, then fused at request time with online-computed real-time context features (subject-Pin interest categories) before the final MLP head. Distinct from the engagement-model use (Transformer outputs projected and fed online to surface-specific tower trees) — here the Transformer is structurally offline only, the online portion lives downstream of its cached output.
- 2026-02-23 Netflix — MediaFM (sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding) — BERT-style Transformer encoder over sequences of shots for multimodal media understanding.
- 2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1 (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — canonical for the frontier-LLM scaling curve: GPT-1 (117M, 2018) → GPT-5/5.1 (est. ~50T, 2025) = five orders of magnitude in eight years; 400K-token context window on GPT-5. Frames the three scaling brick walls (public-data exhaustion, training-cost growth, batch-training boundary). Names Transformer variants in production: MoE (GPT-4 = 8 × 220B, Gemini 1.5+, Grok-1+) vs Dense (Claude).
- Many other wiki pages reference Transformer as a building block (LLM serving, multimodal fusion, sequence modeling).
Related¶
- concepts/long-user-sequence-modeling
- concepts/mixture-of-experts — the sparse-routing transformer variant; per-token in LLMs (GPT-4, Gemini, Grok).
- concepts/dense-transformer — the single-stack variant (Claude).
- concepts/frontier-model-batch-training-boundary — the shared batch-training limitation across all frontier-LLM transformer variants.
- systems/pinterest-ads-engagement-model
- systems/pinterest-shopping-conversion-cg
- systems/pinterest-contextual-sequential-cg
- concepts/hybrid-tower-inference-split
- companies/redpanda