PATTERN Cited by 1 source

Product feedback to eval labels¶

Embed evaluation-label creation in the product surface itself. Every user interaction with a deployed agent — thumbs-up, thumbs-down, free-text correction, explicit rating — is a candidate eval label. Combine the feedback signal with the investigation's own telemetry trace to generate a ground-truth RCA + world-snapshot query set without a separate labelling campaign.

Intent¶

Hand-crafted evaluation labels don't scale. Engineering hours spent writing labels don't keep up with the rate of new scenarios the agent is asked to handle, and the label set stays narrow and non- representative of real traffic. The narrower the label set, the more eval passes over-state production quality. See concepts/evaluation-label.

Mechanism¶

Instrument product feedback with enough context to regenerate a label. A standalone thumbs-down isn't enough; attach the investigation trace (tool calls, intermediate outputs, final answer). Feedback form accepts free-text correction.
Generate a candidate label from feedback + trace. The correction becomes the ground-truth RCA; the trace identifies the world-snapshot queries the agent already used (plus any that reviewers add for noise). Ambiguous feedback ("it was slow") is resolved into precise statements ("elevated latency in service X") — see patterns/agent-assisted-label-validation for how.
Score the candidate label. Confidence score across thoroughness, specificity, accuracy; sub-threshold labels go to human review; above-threshold labels admit to the set with lighter review.
Feedback loop compounds with adoption. As more customers use the agent, more feedback arrives, and label volume + diversity scale with product usage — not with a standalone labelling team.

Why it works¶

Representativity. Every added label comes from real traffic. The label distribution tracks the production distribution by construction.
Scaling with adoption. Label supply grows when it matters most (more users → more diverse scenarios → more labels).
Edge-case capture. Negative feedback concentrates on failures, which are exactly the labels that matter for regression detection ("the labels that matter most aren't the ones Bits passes; they're the ones it fails").
Cheap differentiation against adoption. A competitor without a deployed agent has no feedback stream and cannot match the label set's real-world coverage.

Pre-requisites¶

Product instrumentation — feedback surface must exist and capture enough structured + free-form content to regenerate labels.
Trace storage — the agent's investigation must be durable enough to re-derive the world snapshot after the fact.
TTL awareness — source telemetry expires. If label generation runs too long after the feedback arrives, the signals the label would need are already gone. See concepts/telemetry-ttl-one-way-door.
Consent / privacy — user interactions may contain customer data; the pipeline must respect the same data-handling constraints as the agent itself.

Reported impact¶

Datadog reports that embedding label creation into the product surface increased label creation rate by an order of magnitude vs. an internal manual labelling campaign over Datadog's own alerts (Source: sources/2026-04-07-datadog-bits-ai-sre-eval-platform).

Tradeoffs¶

Feedback bias. Users who leave feedback are not a uniform sample of users; silent correct paths are under-represented. Label segmentation ameliorates this; don't let pass rate on the feedback-derived set be the only quality signal.
Correction quality. A user's RCA may be wrong or shallow; the validation pipeline must catch this. See patterns/agent-assisted-label-validation for the agent-assisted + human-alignment mechanism.
Drift between the agent's behaviour when it generated the trace and the current candidate config being evaluated. Mitigated by replaying the snapshot against the candidate, not relying on the recorded trace.

Seen in¶

sources/2026-04-07-datadog-bits-ai-sre-eval-platform — the canonical case study. Bits AI SRE's feedback form drives a label-generation pipeline that grew label volume by an order of magnitude over manual labelling. Now generalised across other Datadog agent products.