CONCEPT Cited by 1 source
User Action as Token¶
User-action-as-token is a recommendation-system framing that treats a user's chronological sequence of actions (bookings, views, searches, clicks, listens, watches, etc.) the same way a language model treats a sentence: as an ordered sequence of tokens consumable by a transformer. Each action is encoded as a dense vector — typically by summing embeddings of its constituent features — and the full sequence is fed through attention layers whose output is used to predict the next item (or an item's probability under the user's context).
Key properties¶
- Uniform representation across action types. Bookings, views, and searches can all be serialized into the same embedding space by summing per-attribute embeddings. The transformer doesn't need separate heads per source — attention handles the weighting.
- Per-action attribute summation. Canonical recipe: each token's
embedding = sum of embeddings of the action's attributes. Airbnb
sums
city + region + days-to-today; Pinterest, YouTube, and Netflix-style recommenders use analogous recipes with item-id / category / recency embeddings (Source: sources/2026-03-12-airbnb-destination-recommendation-transformer). - Short-term and long-term mixing happens inside attention, not outside it. A naive design uses separate models for recent vs historical behavior; user-action-as-token lets one transformer weight the full sequence, so the architecture doesn't pre-commit to a cutoff.
- Contextual features are not tokens. Current time / seasonality / locale typically enter as separate contextual inputs, not as extra tokens in the sequence.
Why the analogy holds¶
- Language tokens and user actions are both discrete, ordered, high-cardinality, and sparse in any given user/document.
- Transformer attention is well-matched to "which past actions matter for this prediction?" — the same question "which prior words matter for the next word?" answers in NLP.
- Pre-training / fine-tuning discipline transfers: both masked-prediction (next-item) and contrastive (same-user-different-session) objectives generalize from language to user sequences.
Why it can mislead¶
- User sessions have much wider gaps than words in a sentence;
the
days-to-todayembedding is load-bearing, not a gimmick. - Action types are heterogeneous — a booking is qualitatively different from a view. Summing attribute embeddings flattens this; some designs add a learned action-type embedding to restore the distinction.
- User vocabulary is enormous (cities × regions × days × action types) compared to a natural language's ~50K tokens; embedding table sizing and sharing become the dominant engineering problem.
Seen in¶
- systems/airbnb-destination-recommendation — per-action
embedding = sum of
city + region + days-to-today, three parallel sequences (booking / view / search) fed through a transformer, final heads predict region + city destination (Source: sources/2026-03-12-airbnb-destination-recommendation-transformer).
Related¶
- concepts/vector-embedding — generic dense numerical representation; user-action-as-token produces one embedding per action via attribute summation.
- patterns/active-dormant-user-training-split — training-data companion: what the "token sequence" is changes between active and dormant users.
- patterns/hierarchical-multitask-geo-prediction — output-side companion: multiple prediction heads over the sequence encoder.