CONCEPT Cited by 1 source
Domain-specific tokenization¶
Definition¶
Domain-specific tokenization replaces generic text tokenizers (BPE, SentencePiece) with purpose-built vocabularies that map directly to domain entities. Each token corresponds to a meaningful product concept (an item, a category, an action type) rather than subword text fragments.
Advantages¶
- Computational efficiency — Dramatically shorter sequences. Netflix's GenPage compresses a watch event from 16 GPT-5 tokens to 4 domain tokens.
- Product control — Direct mapping between tokens and business concepts enables token-level constrained decoding for business rules.
- Interpretability — Each generated token has clear semantic meaning (a specific movie, a specific row type).
Trade-offs¶
- Requires daily vocabulary updates as the catalog evolves (new entities/rows)
- New vocabulary items need fallback token strategies until embeddings are learned
- Cannot leverage pre-trained language model weights (must train from scratch)
Seen in¶
- sources/2026-06-29-netflix-genpage-generative-homepage-construction — Netflix GenPage uses one token per entity/row, achieving ~4× compression vs. text tokenizers