PATTERN Cited by 1 source

Visual eval-grading canvas¶

Visual eval-grading canvas is the pattern of building the human-labeling UI inside the product itself — reusing the product's own visualization primitive and extension API — rather than building a separate bespoke labeling tool. The product's interaction surface becomes the evaluation interaction surface, minimising internal-tooling cost while giving the labelers (typically internal employees) a familiar interface.

Intent¶

Relevance labeling for search / ranking needs three properties:

Low-friction input. Labelers should mark correct/incorrect quickly — thousands of labels is common.
Right visualization primitive. For visual content (designs, images, videos), a flat row-by-row labeling UI is inferior to one that lets the labeler see many candidates at once in spatial context.
Version-to-version comparison. Labelers should see whether the model has improved since the last iteration, not just label each iteration in isolation.

If the product already has a UI primitive that solves (2) and (3) — e.g. an infinite canvas, a whiteboard, a node-graph editor, a timeline view — and exposes a plugin / extension API, building the labeling tool on top of those primitives is cheaper than building a separate tool from scratch.

Mechanism (Figma realization)¶

Figma AI Search's eval-grading tool (Source: sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma):

Built on Figma's public plugin API — same API third-party plugins use.
Displayed on Figma's infinite canvas — results laid out spatially, so the labeler sees many candidates at once with zoom / pan for detail.
Keyboard shortcuts for labels — "mark correct" / "mark incorrect" bound to keys for rapid labeling; avoids the mouse-click round-trip that dominates UI labelers.
Historical comparison view — labelers can see whether the search model had improved between runs over the same query set.

The eval set itself was seeded from internal-designer interviews + analysis of how people used Figma's file browser, spanning query shapes: - Simple ("checkout screen") - Descriptive ("red website with green squiggly lines") - Specific ("project [codename] theme picker")

Why the in-product canvas wins¶

Zero bespoke tooling. The plugin API is already there. The infinite canvas is already there. The eval tool is code on top, not a separate internal service.
Correct primitive for visual content. An infinite canvas is designed for spatially-laid-out 2D artefacts. Reinventing this in a labeling tool is expensive.
Labelers already know the UI. Internal designers labeling search results know Figma already. No onboarding; no separate login; no context-switch-to-labeling-tool.
Historical overlay is natural. The canvas already supports multi-state overlays; past-run results vs current-run results is a natural extension.

When this applies¶

Good fit when all of the following hold:

The product has a content-type-appropriate visualization primitive (canvas / graph / timeline / board).
The product exposes a plugin or extension API with enough surface to render results, accept input, persist labels.
The labelers are internal employees already familiar with the product (or will be; light training OK).
The evaluation task is about the content the product shows — search results, recommendations, clustering, layout quality.

Poor fit when:

The product doesn't have a visualization primitive matching the eval task (e.g. evaluating text-chat ranking in a canvas-only product adds friction instead of removing it).
Labelers are external contractors whose product familiarity is zero — onboarding to a separate, minimal labeling UI may actually be faster.
The eval dataset is huge (millions) and needs distributed labelers — the in-product labeling path usually scales worse than dedicated labeler platforms.

Relationship to LLM-judge labeling¶

Visual eval-grading canvas is a human-first labeling pattern. Dropbox Dash's patterns/human-calibrated-llm-labeling is the LLM-as-judge-scaled variant: humans label a small seed set, an LLM judge (calibrated to agree with humans) labels the production training set at ~100× scale.

The patterns are complementary, not competing:

Visual eval-grading canvas is a good way to produce the seed set a calibrated LLM judge would be tested against. The rapid-labeling UI + historical comparison view both reduce the cost of building and maintaining high-quality human judgement data.
Human-calibrated LLM labeling is a good way to scale past the human seed set when the labeling task is amenable to LLM judgement (text-heavy, rubric-articulable, within privacy compliance).

Figma's post doesn't describe an LLM-judge extension to their canvas labeling; don't cite aspirationally.

Design questions¶

Persisted labels go where? Directly in the product's own storage (Figma file metadata?) or a separate eval datastore?
Who's allowed to label? Role-gated inside the plugin; labeler identity captured for inter-annotator-agreement later.
Versioning of the eval set. Query set evolves; old labels should remain comparable. Probably version the query set, not individual labels.
Blind-label mode. Historical comparison + side-by-side A/B introduces anchoring bias. Blind modes ("which of A or B is better?" without showing which is current prod) can help.

Caveats¶

Plugin API surface limits. Figma's plugin API is more restricted than the full product UI. Some UX moves possible in product code aren't possible in a plugin — design around what the API provides.
Doesn't generalise beyond visual content. This pattern works because Figma's content is designs on a canvas. A text-document search product would get less mileage from the "infinite canvas" part — though the "build on top of the product's own primitive" idea would still apply (e.g. label ranking on top of the product's own document-reader view).
Fails if labelers disengage. The "familiar UI" advantage is negative if labelers don't internalise the labeling rubric. The rubric itself still needs training — the canvas just removes the tool-friction part of the job.

Seen in¶

sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma — Figma AI Search eval tool: infinite canvas + plugin API + keyboard shortcuts for correct/incorrect + historical-run comparison. Eval set seeded from internal-designer interviews + file-browser usage analysis.

systems/figma-ai-search — canonical instance.
concepts/relevance-labeling — the label-generation activity this tool operationalises.
patterns/human-calibrated-llm-labeling — the LLM-scaled labeling complement.