Skip to content

CONCEPT Cited by 1 source

Domain-specific tokenization

Definition

Domain-specific tokenization replaces generic text tokenizers (BPE, SentencePiece) with purpose-built vocabularies that map directly to domain entities. Each token corresponds to a meaningful product concept (an item, a category, an action type) rather than subword text fragments.

Advantages

  1. Computational efficiency — Dramatically shorter sequences. Netflix's GenPage compresses a watch event from 16 GPT-5 tokens to 4 domain tokens.
  2. Product control — Direct mapping between tokens and business concepts enables token-level constrained decoding for business rules.
  3. Interpretability — Each generated token has clear semantic meaning (a specific movie, a specific row type).

Trade-offs

  • Requires daily vocabulary updates as the catalog evolves (new entities/rows)
  • New vocabulary items need fallback token strategies until embeddings are learned
  • Cannot leverage pre-trained language model weights (must train from scratch)

Seen in

Last updated · 560 distilled / 1,653 read