CONCEPT Cited by 1 source

User Action as Token¶

User-action-as-token is a recommendation-system framing that treats a user's chronological sequence of actions (bookings, views, searches, clicks, listens, watches, etc.) the same way a language model treats a sentence: as an ordered sequence of tokens consumable by a transformer. Each action is encoded as a dense vector — typically by summing embeddings of its constituent features — and the full sequence is fed through attention layers whose output is used to predict the next item (or an item's probability under the user's context).

Key properties¶

Uniform representation across action types. Bookings, views, and searches can all be serialized into the same embedding space by summing per-attribute embeddings. The transformer doesn't need separate heads per source — attention handles the weighting.
Per-action attribute summation. Canonical recipe: each token's embedding = sum of embeddings of the action's attributes. Airbnb sums city + region + days-to-today; Pinterest, YouTube, and Netflix-style recommenders use analogous recipes with item-id / category / recency embeddings (Source: sources/2026-03-12-airbnb-destination-recommendation-transformer).
Short-term and long-term mixing happens inside attention, not outside it. A naive design uses separate models for recent vs historical behavior; user-action-as-token lets one transformer weight the full sequence, so the architecture doesn't pre-commit to a cutoff.
Contextual features are not tokens. Current time / seasonality / locale typically enter as separate contextual inputs, not as extra tokens in the sequence.

Why the analogy holds¶

Language tokens and user actions are both discrete, ordered, high-cardinality, and sparse in any given user/document.
Transformer attention is well-matched to "which past actions matter for this prediction?" — the same question "which prior words matter for the next word?" answers in NLP.
Pre-training / fine-tuning discipline transfers: both masked-prediction (next-item) and contrastive (same-user-different-session) objectives generalize from language to user sequences.

Why it can mislead¶

User sessions have much wider gaps than words in a sentence; the days-to-today embedding is load-bearing, not a gimmick.
Action types are heterogeneous — a booking is qualitatively different from a view. Summing attribute embeddings flattens this; some designs add a learned action-type embedding to restore the distinction.
User vocabulary is enormous (cities × regions × days × action types) compared to a natural language's ~50K tokens; embedding table sizing and sharing become the dominant engineering problem.

Seen in¶

systems/airbnb-destination-recommendation — per-action embedding = sum of city + region + days-to-today, three parallel sequences (booking / view / search) fed through a transformer, final heads predict region + city destination (Source: sources/2026-03-12-airbnb-destination-recommendation-transformer).

concepts/vector-embedding — generic dense numerical representation; user-action-as-token produces one embedding per action via attribute summation.
patterns/active-dormant-user-training-split — training-data companion: what the "token sequence" is changes between active and dormant users.
patterns/hierarchical-multitask-geo-prediction — output-side companion: multiple prediction heads over the sequence encoder.

User Action as Token¶

Key properties¶

Why the analogy holds¶

Why it can mislead¶

Seen in¶

Related¶