CONCEPT Cited by 1 source

Trust & Safety classifier¶

Definition¶

A Trust & Safety (T&S) classifier is a small fine-tuned model sitting in front of a user-facing LLM application that classifies incoming user questions (or prompts) into safety labels and cancels downstream work on unsafe inputs — returning a templated safe response instead. The T&S classifier is a pre-retrieval safety gate: it runs before any content retrieval, prompt composition, or generation, so unsafe questions consume only the classifier's tokens rather than the full pipeline's cost + latency.

The concept generalises across any user-facing LLM product but is canonicalised on the wiki by Yelp's Biz Ask Anything system (2026-03-27).

Why pre-retrieval (not post-generation)¶

Three structural reasons:

Cost. Unsafe questions are the expensive ones to generate for (e.g. prompt-injection attempts often try to extract long responses). Blocking pre-retrieval avoids the generation cost entirely.
Latency. T&S classification runs in parallel with content fetching; the rejection path short-circuits the request to a templated response.
Defence-in-depth. Post-generation filters can miss creative outputs; pre-classification sees the intent before any evidence is assembled.

What unsafe labels cover¶

Yelp's canonical examples:

System attacks (prompt injection, instruction override).
Illegal-activity prompts (requests to plan illegal acts).
Off-topic / harmful questions that should not surface an LLM answer at all.

"When testing our chat bot internally and on questions from real consumers posted to the Ask the Community feature on Yelp business pages, we found various questions that we should avoid answering." (Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product)

On reject: templated safe answer¶

Yelp's response discipline: return a templated, safe answer and encourage users to focus on asking questions about the business services or offerings. The classifier is not empowered to rephrase or partially-answer — binary gate on the safety dimension.

Training data discipline¶

Yelp's T&S training set:

A few thousand question-label pairs.
~50% legitimate questions mixed in as negatives (prevents over-rejection).
Seeded by manually-crafted question-labels, then LLM-generated variations for paraphrase diversity.
"The majority of our time was spent on crafting examples that could be considered ambiguous or borderline until we reached a precision-recall sweetspot."

Classic precision-recall tuning: too high a threshold → unsafe questions leak through; too low → legitimate questions incorrectly rejected.

Base model: a small fine-tuned LLM¶

Yelp fine-tuned GPT-4.1-nano as the T&S base. Load-bearing properties of the small-model choice:

Low latency — T&S runs on the critical path of every request; a frontier-model call would blow the latency budget.
Fine-tuning is cheap on a few-thousand-label dataset.
Inference cost — every request pays for T&S, so per- request cost dominates aggregate economics.

Architectural position¶

T&S runs in parallel with the other three question- analysis components (Inquiry Type, Content Source Selection, Keyword Generation) — see patterns/parallel-pre-retrieval-classifier-pipeline. On T&S reject, downstream content-fetch + answer-generation work is cancelled to save latency and cost.

Evolution: "Labels keep changing"¶

Yelp's stance: T&S is a live label taxonomy, not a fixed-at-launch classifier. "As usage shifts, new behaviors appear. We treat T&S and Inquiry Type as live label taxonomies and periodically add/update labels."

Tradeoffs / gotchas¶

False positives are user-hostile — legitimate questions wrongly flagged degrade the product experience; the 50%- legitimate-in-training balance is essential.
False negatives are brand-hostile — unsafe answers erode user trust and can be screenshot-amplified.
Adversarial adaptation — prompt-injection techniques evolve; the classifier needs continuous training-data augmentation from live traffic.
Multi-label overlap — a question can be flagged under multiple categories (injection + illegal); Yelp's labeling permits multi-label.
Language coverage — the post doesn't disclose T&S training-data language mix or behaviour on non-English inputs.
Cost of the classifier itself — running T&S on every request adds a fixed per-request cost; Yelp mitigates via the small-model (nano) choice.

Seen in¶

sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical wiki instance. Yelp's BAA runs a fine-tuned GPT-4.1-nano T&S classifier as one of four parallel question-analysis components.

concepts/inquiry-type-classifier — sibling scope gate; runs in parallel with T&S.
concepts/content-grounded-answer — the product discipline the T&S gate supports.
patterns/parallel-pre-retrieval-classifier-pipeline — the pattern T&S lives inside.
systems/yelp-biz-ask-anything — canonical production instance.
systems/gpt-4-1-nano — the fine-tuning base.
companies/yelp