PATTERN Cited by 2 sources

Human-calibrated LLM labeling¶

Human-calibrated LLM labeling is the pattern of training a high-volume ML model (a ranker, classifier, preference model, …) on labels generated by an LLM judge that is itself calibrated against a small seed set of human labels. Humans don't label the training set; humans label the calibration set that teaches the LLM how to label. The LLM then scales the labeling effort by ~100×.

Shape¶

Seed. A small team of human evaluators labels a high-quality dataset. Scale: "orders of magnitude smaller than what would be required for full training" (Source: sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search).
Calibrate. Use the seed set to tune the LLM judge — prompt engineering, model selection, reasoning depth, tool access, and optionally an automated optimiser like DSPy. Objective: minimise judge-vs-human disagreement (e.g. MSE on a graded scale).
Gate. Only once the judge meets an agreement threshold against the seed set is it allowed to produce production labels.
Amplify. The calibrated LLM labels hundreds of thousands to millions of (input, output) pairs used to train the production model.
Anchor. The human seed set is retained as a permanent reference against which judge drift (new model version, prompt change, product-requirement shift) is continuously monitored.

Why it works¶

Economics. Human labeling is expensive and slow. An LLM judge is "significantly cheaper, more consistent, and capable of evaluating much larger candidate sets across languages."
Privacy. LLMs can "analyze customer content within defined compliance boundaries" — humans typically cannot review sensitive/proprietary customer data. The calibration set uses only non-sensitive internal content, bypassing the privacy ceiling on pure human labeling.
Consistency. Humans drift across labelers + across time; a calibrated LLM applies the same rubric uniformly.
Bounded-badness. The human seed set provides a floor: if the LLM starts disagreeing, MSE on the seed set rises and the drift is visible before it pollutes the production training set.

Force-multiplier framing¶

Named in the Dropbox post: "humans teach the LLM, and the LLM generates large-scale training data in return." Dash diagram shows ~100× multiplier from human effort → training-set scale. The pattern transforms labeling from a headcount-bound bottleneck into a compute-bound problem.

Contrast with alternatives¶

Approach	Cost	Scale	Coverage	Privacy
Pure human labeling	High	Low	Comprehensive	Can't cover customer data
Behaviour-inferred (clicks/skips)	Free	Large	Sparse + biased	OK but incomplete
LLM-only (uncalibrated)	Low	Large	Comprehensive	Unreliable
Human-calibrated LLM	Medium	Large	Comprehensive	OK (LLM reads customer data)

Behaviour signals are kept as a supplement (and as an input to patterns/behavior-discrepancy-sampling) — not a primary label source.

Why LLMs aren't used at query time¶

A consistent framing in the Dash post: LLMs are too slow + too context-hungry to replace the ranker at serving time. They are used offline to teach a smaller efficient production model. Cognate with the general training-vs-serving split in ML systems — move intelligence off the serving path.

When to reach for it¶

Labeling cost dominates model-quality budget.
Labels are graded or structured enough to state a rubric.
You can produce an LLM judge whose agreement with humans is measurable (MSE, disagreement rate, rank correlation).
Production serving latency forbids LLM-at-query-time — forcing you to train a smaller model whose quality is label-limited.
You expect continued product / rubric / model evolution — needing a re-usable calibration loop rather than a one-off labeling push.

Tradeoffs¶

Judge bias is training bias. If the LLM systematically under-rates or mis-disambiguates a class of queries, the trained ranker inherits it. Mitigations: patterns/behavior-discrepancy-sampling to surface mismatch cases, patterns/judge-query-context-tooling to close the vocabulary gap.
Calibration is continuous, not one-shot. Model updates, prompt changes, content evolution → re-check MSE on the human seed set, re-tune.
Requires a high-quality seed set. If the human labels are inconsistent or biased, the entire amplifier is biased. "Even humans—multiple humans—will disagree" — label against a rubric, compute inter-annotator agreement.
Not a substitute for rubric design. Garbage rubric → LLM optimises to garbage faster than humans did.

Seen in¶

sources/2026-02-26-dropbox-using-llms-to-amplify-human-labeling-dash-search — canonical instantiation. Dash's relevance ranker (systems/dash-relevance-ranker, XGBoost-class) is trained on hundreds of thousands to millions of LLM-generated 1–5 relevance labels; the LLM judge is calibrated against a small human-labeled internal dataset via MSE-on-1–5-scale; DSPy automates prompt tuning; the ~100× diagram is the source of the force-multiplier framing.
sources/2026-02-17-instacart-turning-data-into-velocity-capers-edge-and-cloud-data-flywheel-with-capsight — Computer-vision specialisation. Instacart's Capsight Depot runs a VLM + internal teacher models to generate pre-labels, then human annotators correct rather than create them (see dedicated pattern patterns/vlm-assisted-pre-labeling). Structurally the same human-amplification idea as Dash: a model that's decent-but-not- perfect does the bulk work, humans close the accuracy gap and their corrections become the next training round's labels. Projected >70% annotation-cost reduction at Caper scale.

concepts/llm-as-judge — the calibrated evaluator itself.
concepts/rag-as-a-judge — judge augmented with retrieval.
patterns/judge-query-context-tooling — judge given tools to research the labeling context before scoring.
patterns/behavior-discrepancy-sampling — how to choose which cases to route to human review to keep the calibration loop efficient.
patterns/prompt-optimizer-flywheel — the DSPy-driven tightening loop that sits inside the calibration stage.
systems/dash-relevance-ranker — the production model the pattern trains at Dash.
systems/dropbox-dash — the product the ranker serves.