PATTERN Cited by 1 source
Visual eval-grading canvas¶
Visual eval-grading canvas is the pattern of building the human-labeling UI inside the product itself — reusing the product's own visualization primitive and extension API — rather than building a separate bespoke labeling tool. The product's interaction surface becomes the evaluation interaction surface, minimising internal-tooling cost while giving the labelers (typically internal employees) a familiar interface.
Intent¶
Relevance labeling for search / ranking needs three properties:
- Low-friction input. Labelers should mark correct/incorrect quickly — thousands of labels is common.
- Right visualization primitive. For visual content (designs, images, videos), a flat row-by-row labeling UI is inferior to one that lets the labeler see many candidates at once in spatial context.
- Version-to-version comparison. Labelers should see whether the model has improved since the last iteration, not just label each iteration in isolation.
If the product already has a UI primitive that solves (2) and (3) — e.g. an infinite canvas, a whiteboard, a node-graph editor, a timeline view — and exposes a plugin / extension API, building the labeling tool on top of those primitives is cheaper than building a separate tool from scratch.
Mechanism (Figma realization)¶
Figma AI Search's eval-grading tool (Source: sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma):
- Built on Figma's public plugin API — same API third-party plugins use.
- Displayed on Figma's infinite canvas — results laid out spatially, so the labeler sees many candidates at once with zoom / pan for detail.
- Keyboard shortcuts for labels — "mark correct" / "mark incorrect" bound to keys for rapid labeling; avoids the mouse-click round-trip that dominates UI labelers.
- Historical comparison view — labelers can see whether the search model had improved between runs over the same query set.
The eval set itself was seeded from internal-designer interviews
+ analysis of how people used Figma's file browser, spanning
query shapes:
- Simple ("checkout screen")
- Descriptive ("red website with green squiggly lines")
- Specific ("project [codename] theme picker")
Why the in-product canvas wins¶
- Zero bespoke tooling. The plugin API is already there. The infinite canvas is already there. The eval tool is code on top, not a separate internal service.
- Correct primitive for visual content. An infinite canvas is designed for spatially-laid-out 2D artefacts. Reinventing this in a labeling tool is expensive.
- Labelers already know the UI. Internal designers labeling search results know Figma already. No onboarding; no separate login; no context-switch-to-labeling-tool.
- Historical overlay is natural. The canvas already supports multi-state overlays; past-run results vs current-run results is a natural extension.
When this applies¶
Good fit when all of the following hold:
- The product has a content-type-appropriate visualization primitive (canvas / graph / timeline / board).
- The product exposes a plugin or extension API with enough surface to render results, accept input, persist labels.
- The labelers are internal employees already familiar with the product (or will be; light training OK).
- The evaluation task is about the content the product shows — search results, recommendations, clustering, layout quality.
Poor fit when:
- The product doesn't have a visualization primitive matching the eval task (e.g. evaluating text-chat ranking in a canvas-only product adds friction instead of removing it).
- Labelers are external contractors whose product familiarity is zero — onboarding to a separate, minimal labeling UI may actually be faster.
- The eval dataset is huge (millions) and needs distributed labelers — the in-product labeling path usually scales worse than dedicated labeler platforms.
Relationship to LLM-judge labeling¶
Visual eval-grading canvas is a human-first labeling pattern. Dropbox Dash's patterns/human-calibrated-llm-labeling is the LLM-as-judge-scaled variant: humans label a small seed set, an LLM judge (calibrated to agree with humans) labels the production training set at ~100× scale.
The patterns are complementary, not competing:
- Visual eval-grading canvas is a good way to produce the seed set a calibrated LLM judge would be tested against. The rapid-labeling UI + historical comparison view both reduce the cost of building and maintaining high-quality human judgement data.
- Human-calibrated LLM labeling is a good way to scale past the human seed set when the labeling task is amenable to LLM judgement (text-heavy, rubric-articulable, within privacy compliance).
Figma's post doesn't describe an LLM-judge extension to their canvas labeling; don't cite aspirationally.
Design questions¶
- Persisted labels go where? Directly in the product's own storage (Figma file metadata?) or a separate eval datastore?
- Who's allowed to label? Role-gated inside the plugin; labeler identity captured for inter-annotator-agreement later.
- Versioning of the eval set. Query set evolves; old labels should remain comparable. Probably version the query set, not individual labels.
- Blind-label mode. Historical comparison + side-by-side A/B introduces anchoring bias. Blind modes ("which of A or B is better?" without showing which is current prod) can help.
Caveats¶
- Plugin API surface limits. Figma's plugin API is more restricted than the full product UI. Some UX moves possible in product code aren't possible in a plugin — design around what the API provides.
- Doesn't generalise beyond visual content. This pattern works because Figma's content is designs on a canvas. A text-document search product would get less mileage from the "infinite canvas" part — though the "build on top of the product's own primitive" idea would still apply (e.g. label ranking on top of the product's own document-reader view).
- Fails if labelers disengage. The "familiar UI" advantage is negative if labelers don't internalise the labeling rubric. The rubric itself still needs training — the canvas just removes the tool-friction part of the job.
Seen in¶
- sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma — Figma AI Search eval tool: infinite canvas + plugin API + keyboard shortcuts for correct/incorrect + historical-run comparison. Eval set seeded from internal-designer interviews + file-browser usage analysis.
Related¶
- systems/figma-ai-search — canonical instance.
- concepts/relevance-labeling — the label-generation activity this tool operationalises.
- patterns/human-calibrated-llm-labeling — the LLM-scaled labeling complement.