CONCEPT Cited by 1 source

Long user sequence modeling¶

Definition¶

Long user sequence modeling is the use of a Transformer encoder over a long history of user actions (viewed/liked/engaged items, queries, dwells, clicks) as a feature encoder in ads ranking / recommendation models. The Transformer summarises user behaviour into a fixed-dimensional representation that feeds into downstream CTR / engagement prediction.

The "long" in long-sequence distinguishes this from short-history approaches (last N actions, hand-engineered counts): modern recsys Transformers ingest hundreds to thousands of past actions per user, relying on attention to weight which historical actions matter for the current candidate.

Architectural role¶

   user action history (long — hundreds to thousands of tokens)
                        │
                        ▼
        [ Transformer encoder — self-attention over sequence ]
                        │
            (user sequence embedding)
                        │
                        ▼
    [ fused with candidate + context features → ranker ]

The Transformer output is typically a pooled or attended user embedding that downstream ranking layers concatenate with candidate-ad features + context features.

Canonical wiki instance — Pinterest ads engagement model¶

Pinterest's unified ads engagement model uses a long-sequence Transformer as one trunk component (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces). The Transformer's outputs pass through a DCNv2 projection layer before downstream crossing + tower-tree layers — the projection is explicitly to "reduce serving latency while preserving signal", because the long-sequence Transformer produces wide output that would be expensive for downstream layers to consume directly.

Why long sequences in ads ranking¶

More signal. User behaviour over hundreds of past actions carries more predictive information than last-N-action summaries.
Attention picks relevance. The Transformer's self-attention learns which past actions matter for each candidate, rather than relying on hand-engineered recency decay.
Scale-invariant shape. One architecture handles new users (short sequences), returning users (longer sequences), power users (thousands of actions) without hand-tuning feature buckets.

The composition claim¶

Pinterest's load-bearing empirical finding: long user sequences did not produce consistent gains in isolation on a single surface:

"When applied in isolation (e.g., MMoE on HF alone, or long sequence Transformers on SR alone), these changes did not produce consistent gains, or the gain and cost trade-off was not favorable. However, when we integrated these components into a single unified model and expanded training to leverage combined HF+SR features and multi-surface training data, we observed stronger improvements with a more reasonable cost profile."

The interpretation: long-sequence Transformers need diverse feature coverage + broader training distribution (provided by multi-surface training data) to clear their cost bar. Surface-specific training data is too narrow to give the Transformer enough signal diversity to justify its serving cost.

Caveats¶

Topology not disclosed. Pinterest doesn't specify sequence length, attention head count, layer count, hidden dim, feature tokenisation, or the user-action vocabulary size.
Prior Pinterest post on user-action sequence modeling (User Action Sequence Modeling for Pinterest Ads Engagement Modeling) is referenced via footnote [2] but not ingested on the wiki.
Cost profile is the headline concern. The projection layer (DCNv2) exists specifically because the long-sequence Transformer's output is expensive to consume — long sequences push against serving-latency budgets.
"Long" is not numerically defined in the Pinterest post.

Seen in¶

2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: long-sequence Transformer as ads-ranking feature encoder; required DCNv2 projection for latency; required multi-surface training data to clear cost bar.