Visibility at Scale: How Figma Detects Sensitive Data Exposure¶

Summary¶

Figma describes Response Sampling, a two-phase security detection system that inspects a configurable fraction of outbound API responses for sensitive data exposure — authorization gaps and unintended field leakage. Phase 1 (Permission Auditor) samples responses for file identifiers and asynchronously re-verifies the requesting user's permission against PermissionsV2. Phase 2 (Sensitive Data Analyzer, "fancy Response Sampling") generalizes the same middleware to any field flagged by FigTag, Figma's internal column-level data classification tool, using the banned_from_clients category as the signal that a value must not appear in API responses under normal circumstances. Both phases run in both staging and production as an observability layer on top of preventive authorization controls, surfacing regressions a code review or unit test could not catch.

Key takeaways¶

Detection as a platform-security layer on top of prevention. Figma had strong preventive controls — PermissionsV2, negative unit tests, E2E tests, penetration testing, bug-bounty — yet chose to add continuous detection modelled on infrastructure-security techniques. Key framing: "We approached this problem with a platform-security mindset — treating our application surfaces like infrastructure and layering continuous monitoring and detection controls on top."
Start with a narrow, high-entropy sensitive token, then generalize. File identifiers were the Phase-1 anchor because they are "high-entropy capability tokens with a known character set and consistent length" — easy to detect in text streams. The infrastructure built around them generalized once FigTag provided a systematic definition of sensitivity.
Middleware in the application server, not a proxy. An after block in the Ruby application server beats an Envoy proxy because the application tier has "direct access to the authenticated user object, the full API response body, and our internal permissions system". A proxy would need to reconstruct user context and couldn't call the authorization engine with user-aware checks.
Async verification + non-blocking failure mode. The after filter extracts candidate identifiers synchronously, then enqueues async jobs for permission verification. If sampling or verification fails, the request still completes — detection never blocks the user-facing path.
FigTag as the classification substrate. FigTag annotates every DB column with a category describing sensitivity and intended usage; annotations propagate to the data warehouse. The banned_from_clients category flags fields that must not appear in API responses (security identifiers, billing details, PII). An ActiveRecord callback records loaded sensitive column values into request-local storage on sampled requests; the after filter compares the serialized JSON against the recorded values.
Cross-service sampling via unified endpoint. LiveGraph submits its own sampling data to the same internal endpoint after producing a response, reusing the processing pipeline without adding overhead to its real-time data flow. Sampling rate is gated by configuration and rate limiting; findings share schema and logging path across services, unified in the analytics warehouse and triage dashboards.
Flexible allowlisting for intentional exposure. An OAuth client secret is legitimately returned from a dedicated credential-management endpoint for authorized users, but would be a critical finding if it appeared in an unrelated response. Dynamic configuration lets the team exclude known-safe cases without redeploy — critical for keeping the false-positive rate low enough that alerts stay trustworthy.
Real findings from the rollout. Within days: file identifiers returned in responses unnecessarily (triggered better data filtering), paths where certain files bypassed permission checks entirely (closed the gaps), long-unused data fields making their way into some responses, resource lists returned without per-item access verification (enhanced permission checks).

Operational lessons (from the post)¶

Performance impact is the first thing to tune. Even small monitoring can add latency. Tuning sampling rates + running checks asynchronously preserved user-facing performance while still yielding meaningful visibility.
Manage false positives or they'll manage you. High FP rate overwhelms teams and destroys trust in alerts. Dynamic allowlisting + rigorous triage workflows filter known-safe cases quickly so engineers focus on genuinely risky findings.
Context matters. Not all sensitive-data exposures are equally problematic. Dynamic configuration tunes detection rules without redeploy.
Layered defense across environments. Running in both staging and production gives two detection lines — early detection before release + ongoing monitoring for regressions.

Architecture shape¶

request → Ruby app server
              │
              ├──(sync) handler produces response body
              │
              └──(after filter, sampled at config rate)
                     │
                     ├── Phase 1 (Permission Auditor):
                     │     parse JSON → extract file identifiers →
                     │     enqueue async job → job re-checks PermissionsV2
                     │     for user × identifier → false positives filtered →
                     │     finding logged
                     │
                     └── Phase 2 (Sensitive Data Analyzer):
                           ActiveRecord callback records
                           banned_from_clients column values
                           into request-local storage during handler
                           execution → after filter compares serialized
                           JSON against recorded sensitive values →
                           finding logged if any appear

        (LiveGraph posts its own sampled responses to an internal endpoint
         → same pipeline; findings unified in analytics warehouse + triage)

Caveats & gaps¶

No disclosed sampling rate, QPS, or p50/p99 overhead numbers.
No false-positive / true-positive rate disclosed.
No detail on the async job system (worker pool, retry, DLQ).
FigTag's internal schema / tagging propagation to the data warehouse is described only conceptually.
"Rate limiting to prevent the processing pipeline from becoming overloaded" is mentioned but not quantified.
Future work: broader PII categories, regulated data, finer-grained sampling controls, automated triage, richer trend reporting — all stated as direction, not shipped.