Skip to content

PATTERN Cited by 1 source

Visual eval-grading canvas

Visual eval-grading canvas is the pattern of building the human-labeling UI inside the product itself — reusing the product's own visualization primitive and extension API — rather than building a separate bespoke labeling tool. The product's interaction surface becomes the evaluation interaction surface, minimising internal-tooling cost while giving the labelers (typically internal employees) a familiar interface.

Intent

Relevance labeling for search / ranking needs three properties:

  1. Low-friction input. Labelers should mark correct/incorrect quickly — thousands of labels is common.
  2. Right visualization primitive. For visual content (designs, images, videos), a flat row-by-row labeling UI is inferior to one that lets the labeler see many candidates at once in spatial context.
  3. Version-to-version comparison. Labelers should see whether the model has improved since the last iteration, not just label each iteration in isolation.

If the product already has a UI primitive that solves (2) and (3) — e.g. an infinite canvas, a whiteboard, a node-graph editor, a timeline view — and exposes a plugin / extension API, building the labeling tool on top of those primitives is cheaper than building a separate tool from scratch.

Mechanism (Figma realization)

Figma AI Search's eval-grading tool (Source: sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma):

  • Built on Figma's public plugin API — same API third-party plugins use.
  • Displayed on Figma's infinite canvas — results laid out spatially, so the labeler sees many candidates at once with zoom / pan for detail.
  • Keyboard shortcuts for labels — "mark correct" / "mark incorrect" bound to keys for rapid labeling; avoids the mouse-click round-trip that dominates UI labelers.
  • Historical comparison view — labelers can see whether the search model had improved between runs over the same query set.

The eval set itself was seeded from internal-designer interviews + analysis of how people used Figma's file browser, spanning query shapes: - Simple ("checkout screen") - Descriptive ("red website with green squiggly lines") - Specific ("project [codename] theme picker")

Why the in-product canvas wins

  • Zero bespoke tooling. The plugin API is already there. The infinite canvas is already there. The eval tool is code on top, not a separate internal service.
  • Correct primitive for visual content. An infinite canvas is designed for spatially-laid-out 2D artefacts. Reinventing this in a labeling tool is expensive.
  • Labelers already know the UI. Internal designers labeling search results know Figma already. No onboarding; no separate login; no context-switch-to-labeling-tool.
  • Historical overlay is natural. The canvas already supports multi-state overlays; past-run results vs current-run results is a natural extension.

When this applies

Good fit when all of the following hold:

  1. The product has a content-type-appropriate visualization primitive (canvas / graph / timeline / board).
  2. The product exposes a plugin or extension API with enough surface to render results, accept input, persist labels.
  3. The labelers are internal employees already familiar with the product (or will be; light training OK).
  4. The evaluation task is about the content the product shows — search results, recommendations, clustering, layout quality.

Poor fit when:

  • The product doesn't have a visualization primitive matching the eval task (e.g. evaluating text-chat ranking in a canvas-only product adds friction instead of removing it).
  • Labelers are external contractors whose product familiarity is zero — onboarding to a separate, minimal labeling UI may actually be faster.
  • The eval dataset is huge (millions) and needs distributed labelers — the in-product labeling path usually scales worse than dedicated labeler platforms.

Relationship to LLM-judge labeling

Visual eval-grading canvas is a human-first labeling pattern. Dropbox Dash's patterns/human-calibrated-llm-labeling is the LLM-as-judge-scaled variant: humans label a small seed set, an LLM judge (calibrated to agree with humans) labels the production training set at ~100× scale.

The patterns are complementary, not competing:

  • Visual eval-grading canvas is a good way to produce the seed set a calibrated LLM judge would be tested against. The rapid-labeling UI + historical comparison view both reduce the cost of building and maintaining high-quality human judgement data.
  • Human-calibrated LLM labeling is a good way to scale past the human seed set when the labeling task is amenable to LLM judgement (text-heavy, rubric-articulable, within privacy compliance).

Figma's post doesn't describe an LLM-judge extension to their canvas labeling; don't cite aspirationally.

Design questions

  1. Persisted labels go where? Directly in the product's own storage (Figma file metadata?) or a separate eval datastore?
  2. Who's allowed to label? Role-gated inside the plugin; labeler identity captured for inter-annotator-agreement later.
  3. Versioning of the eval set. Query set evolves; old labels should remain comparable. Probably version the query set, not individual labels.
  4. Blind-label mode. Historical comparison + side-by-side A/B introduces anchoring bias. Blind modes ("which of A or B is better?" without showing which is current prod) can help.

Caveats

  • Plugin API surface limits. Figma's plugin API is more restricted than the full product UI. Some UX moves possible in product code aren't possible in a plugin — design around what the API provides.
  • Doesn't generalise beyond visual content. This pattern works because Figma's content is designs on a canvas. A text-document search product would get less mileage from the "infinite canvas" part — though the "build on top of the product's own primitive" idea would still apply (e.g. label ranking on top of the product's own document-reader view).
  • Fails if labelers disengage. The "familiar UI" advantage is negative if labelers don't internalise the labeling rubric. The rubric itself still needs training — the canvas just removes the tool-friction part of the job.

Seen in

Last updated · 200 distilled / 1,178 read