Skip to content

Figma

Figma Engineering blog — Figma is a browser-based collaborative design tool. Its client is a C++ application compiled to WebAssembly + a TypeScript UI layer, so engineering content skews toward client-perf and build-tooling topics (unusual for a web product) plus the usual backend / infra / live-collaboration posts.

Tier: Figma is not in AGENTS.md's formal Tier 1/2/3 lists; treat as Tier-3-equivalent — apply the Tier-3 selectivity filter (skip pure product-PR / hiring / design-trend posts). Distributed-systems-internals, build-systems, and client-perf engineering posts with numbers are on-topic.

Architectural framing: Figma's 2026-04-21 game-engine-inspiration post (sources/2026-04-21-figma-how-figma-draws-inspiration-from-the-gaming-world) is the umbrella article for how the rest of Figma's engineering content composes. The client architecture is a game-engine stack adapted for the browser (patterns/game-engine-stack-for-web-canvas) — C++ → WebAssembly for the canvas, React + TypeScript for the UI shell, Rust for the multiplayer server. Every product feature lands as a named system (concepts/game-engine-architecture) — multiplayer, collision, animation, chat/audio, auto-layout, component/variant, widget/plugin, plus Figma-specific systems (permissions DSL, parameter runtime, materializer, etc.). Composition brings cross-subsystem bug propagation — the connector/autosave cascade is the canonical disclosed instance. The [[sources/2026-04-21-figma-rendering-powered-by-webgpu|2026-04-21 WebGPU rendering deep-dive]] is the canvas-leg corroborator for the game-engine framing at the graphics-API layer: explicit per-draw state (concepts/explicit-graphics-state), compute shaders (concepts/compute-shader), and shared C++ code across browser Wasm and native via Dawn.

Key systems

  • Canvas renderer (WebGPU-capable, C++/WASM) — Figma's browser canvas is a C++ codebase compiled to WebAssembly via Emscripten; the same C++ also compiles natively (x64/arm64) for server-side rendering. Historically a WebGL renderer; as of 2026-04-21 ships a WebGPU backend as a peer via a graphics-API interface layer with encode/submit uniform batching, a custom shader translator (naga-backed for WGSL emission), telemetry-driven device blocklist, and mid-session WebGPU→WebGL fallback. Shares WebGPU implementation with native targets via Dawn. Downstream: RenderServer is the server-side rendering consumer of this same code path.
  • systems/figma-ai-search — Figma's AI-powered search feature (shipped at Config 2024) combining visual search (query by screenshot / selected frame / sketch) and semantic search (query by natural text against component names/descriptions/files even when terminology doesn't match). Originated in a June 2023 three-day AI hackathon whose most ambitious prototype was design autocomplete; user research on the prototype revealed 75% of Figma-canvas objects come from other files — search became the higher-leverage ship. Indexing policy stacks heuristics (patterns/selective-indexing-heuristics): top-level frames at common UI dimensions + non-top-level exceptions, near-duplicate collapsing, file-copy skipping, experimental ready-for-dev quality signals; plus patterns/edit-quiescence-indexing (index only after 4h of no edits — keeps WIP out, sheds load). Eval tool built on Figma's own infinite canvas + public plugin API + keyboard shortcuts (patterns/visual-eval-grading-canvas); eval set seeded from internal-designer interviews + file-browser usage analysis. Product bar: deliver across similarity tiers (exact / near-similar / broad) because users start from close matches and expand outward — "if we couldn't prove we could find the needle in the haystack, designers wouldn't trust the feature for broader exploration." Surfaced in Actions panel with peek previews + CMD+Enter full-screen drill-down. Four shipping principles: AI-for-existing-workflows / rapid iteration / systematic quality checks / cross-disciplinary teamwork. Infrastructure (2026-04-21 companion post): embeddings via open-source CLIP (multimodal — text and image into the same space); storage = DynamoDB (metadata + embeddings) + S3 (thumbnails) + OpenSearch k-NN (vector index); inference on SageMaker with batch-size sweet spot. Indexing decomposed into four discrete queued jobs (patterns/pipeline-stage-as-discrete-job) so each stage's batching + retry tunes independently. Enumerating indexable frames runs a headless server-side C++ build of the Figma editor — thumbnail rendering moved from GPU to CPU llvmpipe on newer instances for cost, and from Ruby+JSON to C++ for memory. Edit-quiescence (4h) quantified: cuts to 12% of data processed. Index size halved by excluding drafts, in-file duplicates, and unmodified copies. Query path is hybrid lexical + vector — two OpenSearch indexes queried simultaneously, scores min-max-normalized per index, exact-match boosted, interleaved (patterns/hybrid-lexical-vector-interleaving). concepts/vector-quantization compresses embeddings in-index. Two OpenSearch bugs reported candidly: segment-replication replica non-determinism (fixed upstream in k-NN PR #1808) and _source slimming wiping embeddings on updates (patterns/source-field-slimming-with-external-refetch — fix: re-fetch from DynamoDB).
  • systems/diwydu — Don't Include What You Don't Use: Figma's libclang-based tool that flags #include directives whose symbols are never directly referenced in the including file. Deliberately laxer than Google's systems/include-what-you-use to make retrofitting a large C++ codebase tractable. Runs in CI on feature branches.
  • systems/includes-py — pure-Python (no Clang) static transitive-byte counter over the C++ include DAG; runs in CI on every PR; warns on regressions in post-pre-processing byte count per source file. The CI gate that prevents 50-100 build-time regressions per day.
  • systems/skew — compile-to-JS language Figma cultivated for its prototype-viewer / mobile codebase (~2014–2024); migrated off to systems/typescript over 2020–2024. Static types, optimizing compiler with devirtualization. Author Evan Wallace later wrote systems/esbuild.
  • systems/figma-parameter-runtime — unified 2024–2026 substrate powering Figma's two parameter systems (component properties + variables) with a single typespace, a single binding store (invariant: at most one parameter per bound property), and a shared four-stage runtime (parameter-usage tracking / property-granular invalidation / transitive resolution / update). Unlocked component-property-to-variable binding and a speed-up on variable-mode + variable-value changes; hosts a prospective third parameter system (Figma Sites CMS) for free. Sibling in-memory bidirectional graph to QueryGraph over the same object-tree document model, indexing different edges (parameter-to-bound-property vs read+write deps between nodes).
  • systems/figma-materializer — 2026 generic client-runtime framework for maintaining derived subtrees of the document tree via feature-owned blueprints. Replaces the 2016-era Instance Updater (accreted bespoke logic for auto layout / variants / component properties / variables over a decade). Reactivity model: push-based invalidation with automatic dependency tracking — deps recorded implicitly as nodes read data during materialization. Pull-based explicitly rejected because Figma's cross-tree references + deep nesting force reconstructing large dep chains on every read. Shipped clients: component instances (ported), rich text nodes (first net-new feature built on it), slots (open beta April 2026 — composes on top rather than reintroducing bespoke reactivity). Reported canonical impact: variable-mode changes in large files 40–50% faster, "representative of broader gains." Rolled out behind months of side-by-side runtime validation against hundreds of thousands of real files (gate: matched correctness AND matched performance). Sibling-third client reactive graph alongside QueryGraph + Parameter Runtime over the same object-tree document model; the parallel runtime-orchestration unification described in the same post surfaced + eliminated "back-dirties" (concepts/back-dirty), moving client runtime toward unidirectional flow (patterns/runtime-orchestration-unidirectional-flow).
  • systems/figma-multiplayer-querygraph — Figma's real-time-collab server + in-memory bidirectional dependency graph (read + write deps) over file nodes. Foundational 2019 architecture: client-server over WebSocket, one server process per multiplayer document, client-downloads-full-state-on-open with offline-edit replay on reconnect, object-tree data model (Map<ObjectID, Map<Property, Value>>), CRDT-inspired centralized reconciliation (OT explicitly rejected). 2024 QueryGraph extension adds per-session subscribed subsets by reachability; fans edits out server-side only to sessions whose subscription set reaches the edited node. Load-path optimizations: backend preload hint (300–500 ms p75 savings) + parallel decoding via persisted raw offsets (>40% decode-time cut).
  • Compute platform — ran on ECS on EC2 through early 2023; migrated to EKS (three-active-cluster topology) in <12 months, with majority cutover by Jan 2024. Chose EKS for StatefulSets, Helm, CNCF auto-scaling (systems/karpenter / systems/keda), graceful node-drain, and the service-mesh roadmap. Single-step service-definition via a per-service Bazel config; CI generates Kubernetes YAMLs applied one-step by Figma's in-house deploy system.
  • systems/figma-commit-signature-verification — Figma's supply-chain security system: every Git commit pushed to the internal monorepo is cryptographically verified to have been S/MIME-signed with a current device-trust X.509 cert (rotates every 15 days; lives in the MacBook's macOS Keychain). Built as a GitHub App (scoped to read code + write commit status checks only — canonical concepts/least-privileged-access) + an AWS Lambda behind a Function URL webhook. Credentials held in Secrets Manager. Posts the commit-integrity-verification commit status that release-branch protection requires to merge. Bot commits pass through an author allowlist + optional diff-heuristics (e.g. fail Dependabot if it touches non-dependency files).
  • systems/smimesign-figma — Figma's minimally-modified fork of GitHub's smimesign S/MIME Git signer. Adds one flag, --get-figmate-key-id, that walks the macOS Keychain and returns the current device-trust cert key id — the dynamic-lookup primitive that bridges Git's static user.signingkey contract with Figma's 15-day rotating certs.
  • systems/figma-renderserver — Figma's C++ server-side headless editor used for thumbnailing and image / SVG export over user-supplied Figma files. Runs in two different server-side sandboxes per use case: full GPU path in nsjail (user + pid + mount + network namespaces, no network, specific mount points only, seccomp-bpf — chosen over Docker as a drop-in to avoid orchestrated-service rearchitecture), non-GPU path in seccomp-only after a source-code refactor that reorders all openat calls before any image processing so a restrictive libseccomp filter lands mid-program. Seccomp-only trade-offs disclosed honestly: easier to test / debug + significantly faster than nsjail, but locks RenderServer into single-threaded with no dynamic font / image loading. Rollout surfaced nsjail's default rlimit_fsize=1MB silently truncating outputs for large-image inputs, plus several seccomp-allowlist iterations as production hit rare codepaths.
  • systems/figcache — Figma's stateless, horizontally-scalable RESP-wire-protocol caching proxy sitting between applications and a fleet of ElastiCache Redis clusters. Unified Redis data plane: multiplexes many client connections onto a small pool of Redis connections (post-rollout: order-of-magnitude reduction in Redis cluster connection counts + dramatically less volatile during diurnal traffic); centralized multi-cluster routing via a dynamically-assembled engine tree of Router/Redis/Static primitives authored in Starlark; fanout filter engine transparently scatters read-only multi-shard pipelines as parallel scatter-gather (sidesteps Redis Cluster's CROSSSLOT); Redis Cluster emulation shim makes migration a one-line endpoint config change (patterns/protocol-compatible-drop-in-proxy); uniform metrics/logs/traces per command with workload-ownership classification → incident diagnosis hours/days → minutes. Rolled out H2 2025 for Figma's main API service → six-nines uptime on the caching layer. Built by the Storage Products team.
  • systems/respc — Go library inside FigCache's frontend providing an RPC framework over RESP. Four components (server layer, streaming RESP protocol parser, schema-driven structured command parser, implementation-agnostic command dispatch). Schema registry declaratively expresses supported command sequences with annotated arguments — the load-bearing piece that converts opaque bytes into semantically-rich typed commands, enabling every downstream guardrail / custom command / fanout resolution FigCache is built on.
  • systems/livegraph — Figma's real-time data-fetching service: GraphQL-like web API over WebSocket, clients subscribe to queries and receive a JSON tree that stays live. Backs everything non-document (comments, file lists, team membership, optimistic UI, FigJam voting) against RDS Postgres via DBProxy. Learns of changes by tailing the WAL logical replication stream (CDC). 2024–2026 "LiveGraph 100x" rebuild: split monolithic one-server design (mutation-based in-memory cache, single global replication stream) into three independently-scaled Go services (patterns/independent-scaling-tiers) — edge (session / view-query expansion), cache (read-through, sharded by hash(easy-expr), cuckoo-filter fan-out to edges), invalidator (stateless, sharded like the DB, tails WAL per shard). Core unlocks: LiveGraph traffic is driven by initial reads not live updates (so invalidation-based caching with re-query is viable), and most queries are easy to invalidate from the schema alone (so the invalidator is stateless). Query shapes assign stable IDs to un-parameterized queries; mutations pop (shape, args) tuples via substitution without a live-query table. ~700 shapes total, only 11 "hard" (range/inequality); handled via patterns/nonce-bulk-eviction (cache co-locates by hash(easy-expr), two-layer keys with a nonce; easy-expr invalidation deletes the nonce → all hard-query keys orphaned in one op). Concurrency contract: read-invalidation rendezvous — three rules (same-type coalescing, read-during- invalidation blocking, invalidation-during-read blocking) guarantee no invalidation is ever silently overwritten by a racing stale read. Validated by chaos test + online cache verification + old-vs- new convergence checker. Eliminates the old architecture's five structural failures: excessive fan-in / fan-out, read-update coupling, fragmented caches, deploy thundering herds, large blast radius from transient shard failures. Future: auto-reshard invalidators, non-Postgres sources, in-cache server-side permission evaluation (connects to systems/figma-permissions-dsl).
  • systems/dbproxy-figma — Figma's Go service between the application layer and PGBouncer that makes horizontal sharding on RDS Postgres possible. Three-stage query engine (parser → logical planner → physical planner), topology library with <1s backwards-compatible updates, single-shard pushdown + scatter-gather for cross-shard queries, deliberately-restricted sharded-query subset (~90% coverage, no cross-colo joins / no joins off the shard key), feature-flagged per-table rollout gating. Load-shedding + request hedging + transaction support scoped to single shards (no atomic cross-shard transactions — product resilient to partial-commit failures). Shipped first horizontally-sharded table September 2023 with 10s partial primary availability, no replica impact.
  • RDS Postgres substrate — 2020 baseline: single Postgres on AWS's largest instance. End-of-2022: a dozen vertically-partitioned RDS Postgres databases (table-groups like "Figma files" / "Organizations") with caching + read replicas. ~100× database-stack growth since 2020. 2022 onward: horizontal sharding built on top of the vertical-partitioning substrate, keeping RDS Postgres unmodified and extending it via Postgres views as logical shards + DBProxy as the router. Explicit build-vs-buy rejection of CockroachDB / TiDB / Spanner / Vitess / NoSQL migration on 18-month runway pressure + existing operational expertise; choice flagged for future re-evaluation once runway is bought.
  • systems/figma-response-sampling — Figma's in-house sensitive-data-exposure detection system, shipped as an async middleware in the Ruby application server's after filter. Inspects a configurable uniform-random fraction of outbound API responses in both staging and production — the detection-in-depth layer atop PermissionsV2. Phase 1 (Permission Auditor): regex-matches file identifiers in JSON bodies (high-entropy capability tokens) → enqueues async PermissionsV2 re-checks of user × identifier → logs unexpected decisions. Phase 2 (Sensitive Data Analyzer): generalizes to any column tagged banned_from_clients by FigTag via an ActiveRecord callback that records loaded sensitive values into request-local storage on sampled requests; the after filter compares serialized JSON against the recorded set. Cross-service integration: LiveGraph posts sampled responses to an internal endpoint that funnels into the same analytics warehouse + triage dashboards. Non-blocking on the hot path, rate-limited pipeline, dynamic allowlist for intentional-safe exposures (patterns/dynamic-allowlist-for-safe-exposure). Architectural choice of app-server middleware over an Envoy proxy — the app tier has the authenticated user, the response body, and in-process access to PermissionsV2 in one place. Production findings within days of rollout: over-returned file IDs, legacy paths that bypassed permission checks, long-unused leaking fields, list endpoints missing per-item access verification. Canonical wiki anchor for platform-security mindset applied to application surfaces (sources/2026-04-21-figma-visibility-at-scale-sensitive-data-exposure).
  • systems/figtag — Figma's internal data-categorization tool: annotates every database column with a sensitivity category, stored in a central schema and propagated to the data warehouse — so column sensitivity is queryable at both application runtime and offline analytics time. The specific category banned_from_clients is the signal used by Response Sampling's Phase 2 to flag fields that must not appear in API responses under normal circumstances (security identifiers, billing, PII). Canonical instance of field-level sensitivity tagging + the patterns/field-level-sensitivity-tagging pattern (central schema + runtime-queryable + warehouse-propagated + consumed by many enforcement systems without per-system allowlist maintenance). Integration substrate for both application-layer enforcement (ORM callback → request-local storage → after filter comparison) and warehouse-layer controls. Internal details not disclosed: authoring UX, full category set, propagation consistency, coverage / drift mechanisms (sources/2026-04-21-figma-visibility-at-scale-sensitive-data-exposure).
  • systems/figma-permissions-dsl — Figma's in-house authorization engine (early 2021 onward), replacing a Ruby- monolith has_access? function. Three decoupled components: a policy DSL authored in TypeScript + compiled to JSON-serializable ExpressionDef (triples composed by and/or/not, field references as "table.column" strings); an ApplyEvaluator implemented per language (Ruby / TypeScript / Go) against a shared test suite, returning true / false / null (indeterminate); a DatabaseLoader owning data fetching via a context_path resource-addressing map. Evaluation uses patterns/deny-overrides-allow + patterns/progressive-data-loading — load dependencies in batches, short-circuit on determinable verdict — "more than halved" total evaluation time. Static-analysis linter in CI catches known-buggy policy shapes at PR time (patterns/policy-static-analysis-in-ci). React front-end debugger + CLI debugger built on the same evaluator. Design inspired by IAM; OPA / Zanzibar / Oso evaluated and rejected.

Key patterns / concepts

  • concepts/conflict-free-replicated-data-type — CRDT literature (grow-only-set, LWW-register) as the well-studied foundation Multiplayer draws on; Figma relaxes the decentralization overhead because its server is the single authority per document. Canonical "CRDT-inspired but not CRDT-compliant" design.
  • concepts/operational-transform — the Google-Docs-era alternative Figma explicitly evaluated and rejected ("unnecessarily complex for our problem space"; quotes the combinatorial-state-explosion critique).
  • concepts/object-tree-document-model — DOM-like tree of objects reducible to Map<ObjectID, Map<Property, Value>>; Figma's file-schema shape and the substrate QueryGraph's dependency edges index.
  • concepts/parameter-system — "set-once-apply-across" parametrization primitive with two axes (source-of-truth location: scoped vs global; typespace: unified vs parallel); Figma's component properties and variables are the two canonical instances, Figma Sites CMS is a prospective third.
  • concepts/parameter-binding — layer-property ↔ parameter edge with the invariant "at most one parameter per bound property." Unified binding store replaces the prior parallel stores that admitted dual-binding bugs.
  • concepts/transitive-parameter-resolution — multi-hop walk through the parameter reference graph including across parameter-system boundaries (variable aliases + component- property-to-variable chains).
  • patterns/unified-typespace-consolidation — collapse parallel type definitions (e.g. VariableType.BOOLEAN vs ComponentPropType.BOOLEAN) into a single canonical typespace both subsystems reference. Structural pre-condition for cross-subsystem bindings.
  • patterns/prototype-before-production — Figma's three-client simulator playground as the research environment where the multiplayer architecture was sifted before any production-code change landed.
  • concepts/c-plus-plus-compilation-model — transitive #include flatten into a single pre-processed mega-file per TU; compile cost proportional to that mega-file's byte count.
  • concepts/forward-declaration — declare symbol names without full definitions to break include dependencies.
  • concepts/source-map-composition — composing N single-stage source maps through a multi-stage compilation pipeline so browser breakpoints set in the first-stage source resolve correctly in the final bundle. Figma's transpiler pipeline reconstructed this across .sk → .ts → .js.
  • patterns/centralized-forward-declarations — one Fwd.h per directory with all forward declarations needed by other files in the directory; included from every header (but never from source files). Pushes forward-declaration discipline from per-author to per-directory.
  • patterns/ci-regression-budget-gate — measure a resource cost (compiled bytes, in Figma's case) in CI, warn/block on PRs that regress it. Canonically instantiated by includes.py.
  • patterns/gradual-transpiler-migration — migrate from source language A to target language B by building a transpiler, checking both in, shifting build output to B, and deleting A last. Figma's Skew → TypeScript migration is the canonical instantiation.
  • concepts/content-addressed-caching — Bazel remote cache adopted for local builds (>2 min savings when hits); framed as complementary to bytes-reduction, not a substitute.
  • concepts/write-dependency-graph — bidirectional read+write-dep graph over document nodes as the substrate for editor-capable dynamic loading. Explicit FK edges (e.g. instance → component) plus implicit edges (auto layout, frame constraints, cross-page recursive constraint/instance chains). Correctness bar: editing parity — a missing write-dep silently corrupts derived state. Figma's QueryGraph is the canonical instantiation.
  • concepts/reachability-based-subscription — session subscription set = transitive closure of the loaded page over read+write deps; edits filtered server-side by reachability; dep-edge mutations implicitly grow collaborators' subscriptions and ship newly-reachable nodes.
  • patterns/shadow-validation-dependency-graph — run a derived data structure (like QueryGraph) alongside the live authoritative path for "an extended period," reporting errors whenever the authoritative path edits a node the derived structure didn't predict. Pre-condition for flipping dynamic loading live; surfaced a cross-page recursive write-dep at Figma before production impact.
  • patterns/preload-on-request-hint — backend fires a hint to a stateful backend (Multiplayer) on the initial HTTP GET so decoding starts before the client's WebSocket connects; 300–500 ms p75 savings.
  • concepts/tight-migration-scope — change only the substrate, keep the abstraction above unchanged; two exceptions (old- behavior-match expense, one-way doors). Figma's ECS→EKS migration principle.
  • patterns/scoped-migration-with-fast-follows — tight-scope migration + explicit fast-follow list pipelined after the cutover (Keda pod-autoscaling, Vector log forwarding, Graviton, service mesh, ACK). Figma's explicit deferral-and-pipeline discipline.
  • patterns/multi-cluster-active-active-redundancythree active EKS clusters per environment receiving real traffic; a CoreDNS destruction incident cost 1/3 of requests instead of a full outage.
  • patterns/single-source-service-definition — per-service Bazel config → CI-generated YAMLs → one-step deploy; replaced the ECS Terraform-template + separate-deploy two-step.
  • patterns/weighted-dns-traffic-shifting — per-service DNS-weight cutovers from ECS to EKS during migration.
  • patterns/load-test-at-scale — "Hello World" scaled to largest-service pod count before real workloads; surfaced Kyverno sizing as a new-pod-startup bottleneck.
  • patterns/golden-path-with-escapes — opinionated defaults with explicit customization surfaces rather than raw-YAML authoring.
  • concepts/device-trust — corporate-managed laptop holds a short-lived X.509 cert (15-day rotation at Figma) in the OS keychain; cryptographically attests hardware origin for any action it signs. The PKI posture that grounds Figma's commit- signing security model.
  • concepts/commit-signing — Git's three pluggable signer families (GPG / SSH / S/MIME-X.509). Figma uses the S/MIME path because that's how its device-trust PKI plugs into Git.
  • patterns/signed-commit-as-device-attestation — reuse device-trust X.509 certs to sign Git commits, then verify on push so only code originating from a trusted company MacBook can merge. Canonical Figma instance.
  • patterns/wrapper-script-arg-injection — tiny shell wrapper registered as Git's signer program that ignores the args Git passes and invokes the real signer with dynamic-lookup args computed at invocation time. Bridges Git's static user.signingkey contract with 15-day rotating device-trust certs. user.signingkey left deliberately blank.
  • patterns/webhook-triggered-verifier-lambda — GitHub push webhook → AWS Lambda Function URL → stateless cryptographic verification → GitHub commit status. The verification half of Figma's commit-signature system.
  • concepts/connection-multiplexing — decouple upstream Redis connection count from client-fleet elasticity by interposing a proxy tier. Absorbs asymptotic connection-ceiling pressure and thundering-herd new-connection storms during rapid client-fleet scale-ups. FigCache is the canonical Figma instance.
  • patterns/caching-proxy-tier — stateless protocol-native proxy fleet in front of a cache fleet. Responsibilities absorbed: connection multiplexing, multi-cluster routing, topology-change absorption, command-semantics-aware guardrails, inline data transformations, uniform observability, cluster-mode emulation. FigCache is the canonical Figma instance.
  • patterns/protocol-compatible-drop-in-proxy — proxy speaks the backend's native wire protocol (RESP) + cluster-mode emulation shim → migration is a one-line endpoint config change. The integration pattern that made FigCache's reversible, feature- flag-gated per-workload rollout possible.
  • patterns/starlark-configuration-dsl — Starlark program evaluated at init-time in a VM rendering a typed Protobuf config the core server consumes. FigCache uses this to let operators compose engine trees (Router + Redis + Static primitives) in pure config — no server-binary redeploys for routing/guardrail changes. Canonical Figma instance.
  • concepts/vertical-partitioning — split groups of related tables onto separate DB instances, each table still whole on its host. Figma's 2020–2022 scaling lever (≈12 vertical partitions by end of 2022). Stepping stone to horizontal sharding — the 1→1 failover machinery operated during vertical partitioning de-risks horizontal sharding's 1→N physical failover.
  • concepts/horizontal-sharding — split a single table's rows across multiple physical DBs. Figma's 2022–present effort on top of vertically-partitioned RDS Postgres. Shipped first sharded table September 2023 with 10s partial primary availability.
  • concepts/shard-key — Figma picks a small set of shard keys (UserID, FileID, OrgID) rather than forcing a single universal key; hash-of-shard-key routing for uniform distribution (trades off range-scan efficiency on the shard key).
  • concepts/logical-vs-physical-sharding — decouple serve-as- if-sharded application behavior from actual data movement; Figma's central de-risking move. Canonical instance: per-shard Postgres views + feature-flagged DBProxy rollout, seconds-rollback, before the 1→N physical failover.
  • concepts/scatter-gather-query — queries without a shard-key predicate fan out to every shard → same load as unsharded → scale cap. Figma's DBProxy deliberately restricts the sharded- query language to avoid worst-case scatter-gather complexity.
  • patterns/sharded-views-over-unsharded-db — Figma's logical-shard representation: per-shard Postgres views + per- shard connection poolers over the same unsharded physical instance. <10% worst-case view overhead validated against a production query corpus + shadow-reads framework.
  • patterns/shadow-application-readiness — run the logical planner against live production traffic (logged to Snowflake), offline-analyze query plans, pick the supported sharded-query subset covering 90% of queries without worst-case engine complexity. Canonical "API scoping from real traffic."
  • patterns/colocation-sharding — group related tables that share a shard key into a "colo" that shares physical layout and supports cross-table joins + full transactions when scoped to a single shard-key value. Narrows the router's scope while preserving relational semantics for the common case.
  • concepts/permissions-dsl — Figma built an in-house DSL rather than using OPA / Zanzibar / Oso after evaluating all three; design inspired by IAM policies (effect + action + resource + condition).
  • concepts/data-policy-separationApplyEvaluator + DatabaseLoader split. Policy authors name fields ("team.permission"), engine owns data fetching. Pre-DSL, entangled ActiveRecord-calls-in-policy made permissions checks ~20% of Figma's database load.
  • concepts/json-serializable-dslExpressionDef as plain JSON (triples + and/or) enables 2–3-day new-language evaluators, trivial cross-platform consistency, recursive-walk static analysis, and unlocks the CI linter + debuggers.
  • concepts/three-valued-logictrue / false / null indeterminate evaluator return allows early exit when policy verdict is decidable from partial data.
  • patterns/expression-def-triples — the triples + and/or/not boolean-logic DSL shape (also seen in Elasticsearch / MongoDB query languages). Figma's ExpressionDef is the canonical production instance.
  • patterns/deny-overrides-allow — effect-resolution rule (IAM / Cedar / Figma): any deny policy matching → deny; otherwise any allow matching → allow; default deny.
  • patterns/progressive-data-loading — partition the declared dependency set by heuristic (most-commonly-determining first: file / folder / team roles), load a batch, evaluate, exit early if the three-valued verdict is non-null. "More than halved" total evaluation time.
  • patterns/policy-static-analysis-in-ci — build-time linter rejects policy filters matching known-buggy patterns (e.g. field = ref without a sibling field <> null). Explicitly chosen over runtime engine enforcement to preserve evaluator simplicity as the cross-platform invariant.
  • patterns/policy-proof-of-concept-branch — de-risk a policy- engine rewrite by porting every existing rule to the new model on a throwaway branch, running the entire legacy test suite green. Surfaces both the new model's inadequacies (Figma's first PoC did) and hidden product decisions accumulated over years.
  • patterns/consistency-checkers — SQL-based invariant tests comparing expected system state against data recorded across multiple sources of truth on a pre-defined cadence, in both dev and prod; two flavours (data-quality checks validate stored-data-matches-product-state; code-logic checks validate application-behaviour-matches-business-rules); routed structured alerts (rows + metadata) to owning team on violation. Figma Billing's framework, built for the 2025 billing-model refresh, generalised beyond Billing to product security, access/identity management, and growth teams (e.g. connected projects). Canonical wiki instance; Slack Engineering cited as cross- company prior art.
  • patterns/data-application-productization — the bespoke- analysis-to-shared-tool arc: when a cross-system derivation repeats across multiple teams, wrap it in a small durable application that encodes the single-source-of-truth interpretation. Figma's Invoice Seat Report (now "one of the most-viewed data applications at Figma", reconstructs each seat-charge narrative for Support / Order Management / enterprise specialists / billing engineers) and consistency checkers themselves both trace this arc.
  • concepts/small-map-as-sorted-vec — small, schema-bounded associative containers represented as flat sorted Vec<(K, V)> instead of balanced-tree maps. Rust BTreeMapVec<(u16, u64)> on Multiplayer's per-node property map (average ~60 entries, schema-bounded <200) cut large-file memory ~25% and sped up deserialization despite the asymptotic regression — cache locality dominates at small N.
  • concepts/tagged-pointer — pack a 16-bit field ID into the unused top 16 bits of x86-64 u64 pointers, collapsing Vec<(u16, u64)> to Vec<u64>. Benchmarked by Figma's Rust team as a follow-up to the flat-vector rewrite; delivered only ~5% additional RSS vs the simple vector approach (allocated-vs-RSS divergence), and Rust refcount-through- masked-pointer correctness overhead didn't justify shipping — shelved as known-feasible optionality.

Recent articles

  • 2026-04-21 — sources/2026-04-21-figma-server-side-sandboxing-an-introduction (Security-engineering Part 1 of 3 — the umbrella intro to Figma's sandboxing series. Frames the why before the how: memory-unsafe C/C++ libraries processing user-supplied images / documents / SVGs are hostile-input-on-the-inside and keep shipping memory-corruption CVEs — canonical motivating example is ImageTragick (2016), an ImageMagick RCE that hit every server running it on user-supplied images. Figma's posture: "Buggy software is a fact of life … it's nearly impossible to prevent all vulnerabilities," so sandbox as defence-in-depth, accept compromise will happen, bound the blast radius. Rewriting in memory-safe languages is considered and rejected as a primary strategy ("require pulling resources away from other critical security projects … no methods are foolproof") — additive, not exclusive, to sandboxing. Introduces the three-primitive table (VMs / containers / seccomp) and the four-axis decision questionnaire (environment / security + performance / development cost + friction / maintenance + operational overhead). Client-side sandboxing (WebAssembly) is named as orthogonal. Parts 2 + 3 deep-dive the specific rows. No new wiki pages — this is the intro; all concepts / systems / patterns already exist on-wiki from the two companion ingests.)
  • 2026-04-21 — sources/2026-04-21-figma-visibility-at-scale-sensitive-data-exposure (Figma's security-engineering team describes Response Sampling — a two-phase detection system that inspects a configurable fraction of outbound Ruby API responses for sensitive-data exposure, running in both staging and production as the detection-in-depth complement to Figma's preventive authorization stack (PermissionsV2 + negative unit tests + E2E + pentest + bug bounty). Phase 1 (Permission Auditor): Ruby after filter samples at config rate, parses JSON response body, regex-matches file identifiers ("high-entropy capability tokens with a known character set and consistent length"), enqueues async jobs that re-verify user × identifier against PermissionsV2. Phase 2 (Sensitive Data Analyzer, "fancy Response Sampling"): generalizes to any column tagged banned_from_clients by FigTag — Figma's internal column-level data classification tool. Precision trick: an ActiveRecord callback on sampled requests records loaded sensitive values into request-local storage; the after filter compares the serialized JSON against those exact values (avoids coincidental-match FPs; scopes overhead to sampled requests only). Cross-service: LiveGraph submits sampled responses to the same internal endpoint — shared schema + logging path + triage dashboards across services. Explicit architectural choice: middleware in the Ruby app server, not an Envoy proxy — the app tier has the authenticated user object + full response body + in-process access to PermissionsV2; doing it at a proxy would make user-aware permission evaluation "significantly harder". Non-blocking everywhere: sampling/verification failures never fail the user request; rate-limited pipeline bounds overhead. Dynamic allowlisting (patterns/dynamic-allowlist-for-safe-exposure) handles intentional safe exposures (canonical example: an OAuth client secret legitimately returned by a dedicated credential-management endpoint but a critical finding anywhere else). Production findings within days: file identifiers returned unnecessarily (data-filtering fix), paths that bypassed permission checks entirely (gaps closed), long-unused fields leaking into responses (targeted fix), list endpoints that verified parent access but not per-item access (per-item checks added). Stated meta-frame — platform-security mindset applied at the application layer (patterns/platform-security-at-application-layer): "treating our application surfaces like infrastructure and layering continuous monitoring and detection controls on top. By applying techniques typically reserved for lower-level systems to our application layer, we were able to gain continuous visibility into how data moves through our products, without slowing development." Operational lessons: tune sampling rates + run async for performance; manage FPs with dynamic allowlisting + rigorous triage (concepts/false-positive-management) or they manage you; context matters (not all exposures equally severe — dynamic config tunes without redeploy); layered defense across staging + prod. Introduces systems/figma-response-sampling, systems/figtag; concepts concepts/sensitive-data-exposure, concepts/response-body-sampling, concepts/detection-in-depth, concepts/data-classification-tagging, concepts/false-positive-management; patterns patterns/response-sampling-for-authz, patterns/platform-security-at-application-layer, patterns/async-middleware-inspection, patterns/dynamic-allowlist-for-safe-exposure, patterns/field-level-sensitivity-tagging; extends systems/figma-permissions-dsl (now cited as the target of detection-layer spot-checks, not just the preventive engine) and systems/livegraph (cross-service sampling contributor). Numbers not disclosed: sampling rate, QPS, p50/p99 overhead, TP/FP rate, async-job substrate details, FigTag tagging coverage / propagation latency. Sibling security-engineering posts from the same 2026-04-21 batch: VM sandboxing, containers + seccomp, Santa rollout, device-trust commit signing, PermissionsV2 DSL — all six form Figma's security-engineering narrative at production scale.)

  • 2026-04-21 — sources/2026-04-21-figma-rendering-powered-by-webgpu (Figma's year-long C++/WASM canvas renderer migration from WebGL to WebGPU while keeping WebGL as a peer backend. Five substrate-level projects: (1) graphics-API abstraction-layer redesign (patterns/graphics-api-interface-layer) — reshape the interface around WebGPU's explicit-state model; fixed latent WebGL bugs by making draw state explicit (concepts/explicit-graphics-state). (2) shader translator pipeline (patterns/shader-source-translator-pipeline) — custom in-house preprocessor (normalizes older WebGL-1 GLSL to a newer dialect + #include modularity + input metadata extraction) feeding naga to emit both newer GLSL (for WebGL) and WGSL (for WebGPU). Single shader source, two backends, zero drift. (3) uniform buffer batching — encode/submit split in the graphics interface amortizes WebGPU's per-uniform GPU-memory allocation cost; naïve per-uniform mapping would have regressed performance. Uniform buffers are the WebGPU requirement that forced the redesign. (4) shared C++ across browser + native via DawnEmscripten compiles the C++ renderer to Wasm for browsers (migrating to Dawn's emdawnwebgpu), and the same code compiles natively (x64/arm64) for server-side rendering via direct Dawn linkage. One graphics-API surface, two targets. RenderServer / thumbnail- generation path is the named server-side consumer. (5) production rollout with dynamic fallbacktelemetry-driven device blocklist (seeded from compatibility probes and expanded from fallback-rate telemetry) plus mid-session WebGPU→WebGL fallback (extends existing context-loss / device-lost handlers to swap backends rather than recreate the same one) closed the rollout loop after mid-session WebGPU failures appeared on Windows. Quirks: sync-vs-async readback — WebGL's sync readPixels powered Figma's load-time compatibility probes; WebGPU's async-only readback would have added "hundreds of milliseconds" to startup, forcing a shift to non-load-blocking post-session probing. Outcome: "performance improvement when using WebGPU on some classes of devices, and more neutral results on others, but no regressions." Compute shaders (concepts/compute-shader), MSAA, and RenderBundles are named as future-work wins that WebGL simply could not provide — the actual rationale for the migration. Sibling posts in the 2026-04-21 batch: game-engine framing (the C++/WASM canvas leg this deep-dives) and Rust multiplayer-server memory work (the Rust server leg, a different leg of the same three-language stack).)

  • 2026-04-21 — sources/2026-04-21-figma-how-figma-draws-inspiration-from-the-gaming-world (Figma engineering framing piece positioning the client architecture as a game-engine stack adapted for the browser rather than a web stack — canonical umbrella post for the 2026-04-21 batch. Three-language split (patterns/game-engine-stack-for-web-canvas): C++ compiled to WebAssembly for the canvas (rendering, object graph, layout, physics — the 2017 migration cut load time 3× per the linked post), React + TypeScript for the UI shell (explicit rationale: "C++ does not come with great libraries for delightful and responsive user interface engineering"), Rust for the multiplayer server ("better developer ergonomics than C++"). Systems-as-building- blocks vocabulary (concepts/game-engine-architecture) borrowed from game engines — every feature lands as a named system (multiplayer, collision, animation, chat/audio, auto-layout, component/variant, widget/plugin) with a defined boundary. The architecturally load-bearing production story is a disclosed cross-subsystem cascade (concepts/interdependent-systems): a six-month-old PR in the layout subsystem corrupted FigJam connector attachment state → connectors oscillated across collaborators → produced "a huge number of multiplayer messages, which overloaded the multiplayer and autosave systems." Code audit of the connector subsystem revealed nothing; debug-message instrumentation three subsystems away found the root cause. Canonical instance of "one part of the code causes a bug in an entirely different part of the codebase" landing on a real-time collaboration server (systems/figma-multiplayer-querygraph extended with the Cross-system failure mode section). Tables-in-FigJam multiplayer UX introduced patterns/observer-vs-actor-animation: render the same edit twice — live feedback tied to the initiating user's mouse (snappy, no animation) plus animation for observing users (smoothed canvas transitions so remote edits don't jump); plus rubber-band drag-limits explicitly analogized to games' invisible walls. Designer prototype (Jakub Świadek) + engineer framework build (Tim Babb). Collaborative-first product constraint explicitly rejects the lock-during-edit fallback. Keyboard-navigation accessibility layer (partnered with Figma's accessibility team) framed as a game-engine-style control-system concern (parallel to game-controller support including the Xbox Adaptive Controller). Introduces concepts/game-engine-architecture, concepts/web-assembly, concepts/interdependent-systems, patterns/game-engine-stack-for-web-canvas, patterns/observer-vs-actor-animation; extends systems/figma-multiplayer-querygraph (new Cross-system failure mode + Game-engine framing sections), systems/react, systems/typescript (stack-split rationale), concepts/object-tree-document-model (substrate for the cross-node implicit coupling). No latency / QPS / memory / cost / rollout numbers disclosed — narrative / retrospective / hiring-adjacent. Sits as the architectural umbrella under which the other 2026-04-21 Figma posts — LiveGraph 100x, FigCache, [[systems/figma- materializer|Materializer]], AI Search infra, Permissions DSL, etc. — all sit as named systems in the game-engine sense.)

  • 2026-04-21 — sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma (Infrastructure companion to the earlier "How We Built AI Search" post. Embeddings via open-source CLIP (multimodal text+image, same space → one index serves both query modes). Pipeline = four discrete queued jobs over DynamoDB (metadata + embeddings) + S3 (thumbnails) + SageMaker (batched inference) + OpenSearch k-NN (vector index). Cost optimisations that dominated: Ruby → C++ rewrite of frame enumeration + thumbnailing; GPU → CPU llvmpipe software rendering on newer instances; edit-quiescence (4h) cuts to 12% of data; corpus halved by excluding drafts + in-file dups + unmodified copies; vector quantization (concepts/vector-quantization) shrinks in-memory k-NN. Query = hybrid lexical+vector, min-max normalized per index, exact-match boosted, interleaved (patterns/hybrid-lexical-vector-interleaving). Two candidly- reported OpenSearch bugs: segment-replication replica non- determinism (upstream fix in k-NN PR #1808); _source slimming wiped embeddings on update, fix = re-fetch from DynamoDB (patterns/source-field-slimming-with-external-refetch). Scale driver: "small percentage of users onboarded → convergent full- fleet indexing" because teams are small and numerous. Introduces systems/clip-embedding-model, concepts/vector-quantization, patterns/pipeline-stage-as-discrete-job, patterns/hybrid-lexical-vector-interleaving, patterns/source-field-slimming-with-external-refetch; extends systems/figma-ai-search with a full Infrastructure section.)

  • 2026-04-21 — sources/2026-04-21-figma-supporting-faster-file-load-times-memory-optimizations-rust (Rust team's server-side memory optimizations on Multiplayer after dynamic page loading drove ~30% more server-side file decode volume. Hot data structure: per-node Map<u16_property_id, u64_pointer>memory profiling showed it was >60% of per-file memory despite storing only metadata. Fix: replace Rust's BTreeMap with a flat sorted Vec<(u16, u64)>. Schema-bounded key domain (<200 fields, average ~60 per node, entries arrive sorted on the wire) makes the vector's asymptotic O(n) insert never trigger on the load path, and cache-locality wins flip the theoretically-worse container into a practically-faster one: ~25% memory drop on large files, deserialization faster despite the Big-O regression. Second experiment — pointer tagging packing 16-bit field IDs into the unused top 16 bits of x86-64 pointers — delivered marginally faster benchmarks + ~5% lower RSS vs simple Vec (not 20% — RSS-vs-allocated divergence, same lesson as Datadog Go 1.24 runtime/metrics-vs-/proc/smaps); not productionized because refcount-through-masked-pointer correctness wasn't worth the win. Net fleet outcome from shipping just the flat-vector change: +20% p99 file deserialization performance, ~20% memory-cost reduction across the entire Multiplayer fleet. Companion to the 2024 dynamic-page-loading post — that one cut client memory ~70% and slow-file p99 ~33%; this one fixes the server that feeds the client.)
  • 2026-04-21 — sources/2026-04-21-figma-redefining-impact-as-a-data-scientist (Figma Billing DS team's impact-framing post with two architecturally-real outputs: consistency checkers — SQL-based invariant tests (patterns/consistency-checkers) in two flavours (data-quality checks validate stored-data-reflects-product-state; code-logic checks validate application-behaviour-matches-business-rules), run in both dev and prod on a pre-defined cadence across unified product-log + billing-state + payment-processing + CRM data, routing structured alerts (rows + metadata) on violation; built for the 2025 billing-model refresh, since "adopted beyond the billing team, powering data-integrity and code-logic checks across product security, access and identity management, and other growth teams" (connected projects the named concrete reuse); Slack Engineering cited as cross-company prior art. Invoice Seat Report — a data application (patterns/data-application-productization) reconstructing seat-charge narratives by pulling product events + contract metadata + billing rules + past state transitions into plain-language explanations; now "one of the most-viewed data applications at Figma," shared across Support / Order Management / enterprise specialists / billing + monetisation engineers. Rest of the article is DS-role / "redefining impact" framing (pie charts traditional-vs-actual work mix) — ingested narrowly on the two-tool architectural substance. No numbers disclosed.)
  • 2019-10-11 — sources/2019-figma-how-figmas-multiplayer-technology-works (Resurfaced 2025-08-16, HN 176. Foundational description of Multiplayer architecture: client-server over WebSocket, one server process per multiplayer document as the single authority that lets Figma simplify CRDTs, client download-on-open + offline-edit replay on reconnect, documents-only-in-Multiplayer (comments/users/teams in Postgres with separate sync system), object-tree document data model reducible to Map<ObjectID, Map<Property, Value>> — the substrate QueryGraph's dependency edges later index. Explicit rejection of Operational Transforms ("unnecessarily complex for our problem space"; quotes state-explosion critique). CRDT-inspired but not CRDT-compliant: grow-only-set
  • LWW-register as building blocks, decentralization overhead stripped because server is authoritative. Methodology: three-client simulator playground prototyped the architecture before any production-code change. Raw capture truncates before the algorithm-details sections.)
  • 2024-04-27 — sources/2024-04-27-figma-speeding-up-c-build-times (C++ cold build times cut ~50%: custom AST tool DIWYDU catches unused includes, includes.py runs in CI as a transitive-byte regression gate, Fwd.h per directory formalizes forward-declaration discipline; 50-100 potential slowdowns prevented daily)
  • 2024-05-03 — sources/2024-05-03-figma-typescript-migration (Skew → TypeScript migration via custom transpiler + gradual rollout; source-map composition across two compile stages; three load-bearing language-semantic differences — JS array-destructuring perf, devirtualization divergence, TS init-order — drove targeted transpiler patches; bundler defines + DCE replaced Skew's compile-time conditional compilation)
  • 2024-05-22 — sources/2024-05-22-figma-dynamic-page-loading (Dynamic page loading extended from viewers to editors via QueryGraph — a bidirectional read+write dependency graph held in-memory by Multiplayer. Per-session subscribed subset = transitive closure of the initial page; edits fan out to collaborators only where reachable. Shadow-mode validation for "an extended period" flushed out a cross-page recursive write-dep. Server-side decoding (now critical path) paid for by backend preload hint (300–500 ms p75) + parallel decoding via persisted raw offsets (>40% decode-time cut). Client lazily materializes instance sublayers touching dozens of subsystems. Six-month A/B rollout: 33% speed-up on slowest loads despite files +18% YoY, 70% fewer nodes in client memory, 33% fewer out-of-memory errors)
  • 2026-04-21 — sources/2026-04-21-figma-parameter-architecture-unification (Retrospective on unifying Figma's two parameter systems — component properties (2022, scoped) + variables (2023, global) — at the data-model and runtime levels. Parallel type definitions and parallel binding-storage had produced a user-visible dual-binding bug, doubled cost per new type, and a ceiling against cross-system bindings. Fix consolidates to a single typespace + single binding store (invariant: at most one parameter per bound property) + single four-stage runtime pipeline (tracking / invalidation / transitive resolution / update). Unlocked component-property-to-variable binding plus a speed-up on variable-mode and variable-value changes attributed to stricter property-granular invalidation; makes the upcoming Figma Sites CMS a free third parameter system on the same substrate. Sibling in-memory reactive graph to QueryGraph — both bidirectional indexes over the same object-tree document model, indexing different edge types.)
  • 2026-04-21 — sources/2026-04-21-figma-enforcing-device-trust-on-code-changes (Figma security team enforces device-trust on every Git commit merging to release branches. Every company MacBook holds a 15-day-rotating X.509 device-trust cert in the macOS Keychain; commits are S/MIME-signed with it via a Figma-modified smimesign fork (adds --get-figmate-key-id) called through a one-line wrapper bash script that ignores Git's static user.signingkey and substitutes a Keychain lookup. Verification is a GitHub App (read code + write commit status only, canonical concepts/least-privileged-access) + an AWS Lambda behind a Function URL webhook; credentials in Secrets Manager; verification uses smimesign/ietf-cms against Figma's internal CA; posts commit-integrity-verification status that branch protection requires to merge — canonical patterns/webhook-triggered-verifier-lambda. Bot commits (Dependabot + other external Apps, signed with GitHub's web-flow GPG key) pass via an author allowlist with optional diff heuristics (Dependabot touching non-dependency paths → fail). Engineer experience is just a green status check — no extra toil. Introduces concepts/device-trust, concepts/commit-signing, patterns/signed-commit-as-device-attestation, patterns/wrapper-script-arg-injection, patterns/webhook-triggered-verifier-lambda; systems systems/smimesign, systems/smimesign-figma, systems/figma-commit-signature-verification, systems/github-apps.)
  • 2026-04-21 — sources/2026-04-21-figma-rolling-out-santa-without-freezing-productivity (Figma Endpoint Security team rolled out Santa — the Google-originated macOS binary authorization tool — to 100% of laptops over ~3 months without freezing productivity. Four load-bearing design decisions: (1) monitoring-mode first — run Santa in passive mode, mine UNKNOWN-event telemetry to build SigningID + TeamID-dominated allowlist from real fleet execution before any block; (2) self-service Slack approval — block event → sync-server malware check (ReversingLabs + risk signals) → Slack app on Figma-managed devices offers approve/ignore/flag-as-malware; approval creates machine-specific rule (not fleet); >90% of steady-state blocks self-resolve; MDM-triggered santactl sync cuts enforcement latency 60s → 3s; (3) Package Rule auto-generation — config-as-code {package_type: homebrew, package: vim}30-min workflow on macOS runners fetches current SHA-256 from official source → ~200 Package rules expand to ~80,000 Binary rules; (4) cohort percentage rollout 10% → 25% → 50% → 70% → 98% → 100% with per-cohort inclusion criteria, final engineers/data-scientists 30% held for group-scoped permissive rules addressing Anaconda ad-hoc codesign per-machine-unique hashes; /santa disable escape hatch reverts single machine to monitoring mode during rollout, retired at 100%. Sync server is a fork of Airbnb's Rudolph. FAA (file access authz — locking browser cookies to the browser binary) shipped first as zero-workflow-impact win before lockdown. 80K-rule initial sync timeout mitigation: static allowlist for MDM/Chrome/Slack/Zoom. Steady state: ~150 allowlist + ~50 blocklist + ~80K Package-generated + ~50 PathRegex
  • ~10 Compiler + median 3 personal per user; P95 3–4 blocks per user per week. Rule-type trade-off noted (SigningID precision vs TeamID breadth, LogMeIn Team-ID example); known limitation on Compiler rules with go run race condition; TCC-permission regression defence via separate osquery-based auto-unset. Introduces systems/santa, systems/rudolph, concepts/binary-authorization, patterns/data-driven-allowlist-monitoring-mode, patterns/self-service-block-approval, patterns/package-rule-auto-generation, patterns/cohort-percentage-rollout, patterns/rollout-escape-hatch, patterns/static-allowlist-for-critical-rules.)
  • 2026-04-21 — sources/2026-04-21-figma-server-side-sandboxing-virtual-machines (Security-engineering Part 2 of 3 — how Figma thinks about server-side sandboxing (a.k.a. workload isolation) via virtual machines, and why their production instance is AWS Lambda backed by Firecracker micro-VMs. Core frame: a sandbox must answer two questions, not one — can the sandbox escape (hypervisor boundary), and if it can't, what can a compromise do with the VM's own capabilities (network egress + IAM + credential + VM lifetime)? The latter is the pattern patterns/minimize-vm-permissions. Figma's link-metadata fetcher (FigJam link previews)
  • canvas image fetcher run ImageMagick on third-party URLs inside Lambda; they sit outside the Figma production VPC with no IAM pivot into internal services, so an ImageMagick or fetch-logic exploit grants no path to Figma internals. Latency/isolation trade-off surfaced at the tenant level: AWS reuses a tenant's Firecracker VM across requests because "Firecracker offers reasonably quick VM boot times, but the overheads are still too high to pay on many core workflows." Figma accepts this as a reasonable trade for their use case. Gotchas called out: localhost Lambda runtime API is an SSRF hazard (leaks triggering request + accepts forged responses) — Figma blocks application code from making localhost requests; Lambda isn't "raw compute" (easy to over-privilege via VPC placement or IAM); reserved concurrency is a shared account+region quota. Only quantitative datapoint: first un-warmed call took up to 10 seconds before tuning. Also argues VMs are the heavyweight primitive (trade-off profile: ✅ compatibility / full-OS workloads ; ❌ cold-start and orchestration cost ; ⚖ debugging fine, cluster ops hard) and names the four sandbox-choice axes — environment, security + performance, development friction, operational overhead. Entity pages pre-existed from prior partial work and already anchor to this source; this pass completes source-page plumbing + company + index + log cross-referencing. Sibling posts: Part 1 intro not yet ingested; Part 3 containers + seccomp now ingested — see sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp.)
  • 2026-04-21 — sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp (Security-engineering Part 3 of 3 — containers + seccomp rows of the sandboxing primitive table. Frames container escape along three axes: kernel vulnerability (Dirty COW / Dirty Pipe family), runtime implementation bug (systems/runc / systems/docker internals), runtime configuration (operator choices — the axis VMs don't expose). See concepts/container-escape. Introduces seccomp as the narrowest isolation primitive — a syscall allowlist the process runs under — used at scale by Android / Chrome / Firefox; key limitation is that it can't dereference pointer arguments so it can't filter openat by path. Production exemplar RenderServer (C++ headless Figma editor for thumbnailing / SVG export) runs in two sandboxes by path: full GPU path in nsjail (user + pid + mount + network namespaces, no network, specific mounts only, seccomp-bpf — chosen over Docker as a drop-in alternative that avoided building an orchestrated service); non-GPU path in seccomp-only after a source-code refactor that reorders all file opens before any image processing, letting a restrictive libseccomp filter land mid-program. Seccomp-only trade-offs disclosed honestly: ✅ easier to test/debug ; ✅ significantly faster than nsjail ; ❌ locks RenderServer into single-threaded ; ❌ cannot dynamically load fonts or images later in runtime. Figma's disclosed allowlist: write to already-open fds, exit, memory allocation, current time. Rollout foot-guns: nsjail default rlimit_fsize = 1 MB silently truncated outputs for large-image inputs; seccomp allowlist needed several iterations as rare codepaths hit in production weren't exercised in testing (kernel logs only name the failing syscall, no other context). gVisor named as the middle-option technology that reduces container-attack-surface by interposing a user-space reimplemented kernel between the container and the host kernel. Introduces concepts/container-escape, concepts/seccomp, concepts/syscall-allowlist, concepts/linux-namespaces, concepts/kernel-attack-surface; systems systems/nsjail, systems/firejail, systems/docker, systems/runc, systems/gvisor, systems/figma-renderserver; patterns patterns/refactor-for-seccomp-filter, patterns/seccomp-bpf-container-composition.)
  • 2026-04-21 — sources/2026-04-21-figma-figcache-next-generation-data-caching-platform (Storage Products team built FigCache — a stateless, horizontally-scalable, RESP-wire-protocol proxy between Figma's applications and a fleet of ElastiCache Redis clusters — plus first-party client wrappers in Go / Ruby / TypeScript. Rolled out H2 2025 for Figma's main API service → six-nines uptime on the caching layer. Architecture: ResPC (systems/respc) streaming RESP parser + schema-driven structured command parser + implementation-agnostic dispatch in the frontend; dynamically-assembled engine tree of data engines (Redis) and filter engines (Router / Static / fanout) as the backend; entire engine tree expressed in Starlark (patterns/starlark-configuration-dsl) evaluated at init-time to render a typed Protobuf config. Core win: connection multiplexing decouples Redis connection load from client-fleet elasticity — order-of-magnitude reduction in Redis cluster connection counts post-rollout + thundering-herd scale-up class eliminated. Protocol-compatible drop-in proxy: cluster-mode emulation shim + interface- compatible client wrappers → migration is a one-line endpoint config change, gated reversibly by feature flags. Fanout filter engine transparently resolves read-only multi-shard pipelines as parallel scatter-gather (sidesteps CROSSSLOT). Uniform metrics/logs/traces per command with workload-ownership classification → incident diagnosis hours/days → minutes; formal caching-platform SLO now possible; ElastiCache operational events (node failovers, cluster scaling, transient errors) downgraded to zero-downtime background ops — shard failovers now run liberally and frequently as live resiliency exercises. Latency risk controls: weekly production stress test at ≥10× organic peak, zonal traffic colocation (patterns/zone-affinity-routing), per-PR CI CPU/mem profile
  • synthetic-benchmark gates against golden baseline. Build-vs- buy rationale: OSS Redis proxies shipped "rudimentary RPC servers" lacking structured argument extraction → blocked semantics-aware guardrails + custom commands; forks "difficult to keep in sync with upstream." No end-to-end latency numbers, throughput numbers, or cost breakdown disclosed. Introduces systems/figcache, systems/respc, concepts/connection-multiplexing, patterns/caching-proxy-tier, patterns/protocol-compatible-drop-in-proxy, patterns/starlark-configuration-dsl; extends systems/redis, systems/aws-elasticache, concepts/control-plane-data-plane-separation, patterns/zone-affinity-routing.)
  • 2026-04-21 — sources/2026-04-21-figma-how-we-built-a-custom-permissions-dsl (Figma's engineering team rebuilt authorization from a Ruby- monolith has_access? function into a custom cross-platform declarative permissions DSL (systems/figma-permissions-dsl) starting early 2021. Four named forcing functions: (a) monolithic has_access? "a bug could leak access to every single file"; (b) hierarchical integer permission levels + boolean escape-flags produced a non-hierarchical matrix pretending to be hierarchical; (c) permissions checks were ~20% of the database load because ActiveRecord calls and policy logic were entangled in one function (concepts/data-policy-separation); (d) cross-platform drift between Sinatra (Ruby) and LiveGraph (TypeScript) permissions code was a chronic bug source. Design inspired by IAM policies (effect + action + resource + condition) but OPA, Zanzibar, and Oso were all evaluated and rejected. First PoC was a Ruby AccessControlPolicy with an imperative apply? — ported every existing rule onto a green-CI branch ([[patterns/policy-proof-of- concept-branch]]), which surfaced that attached_through was clumsy and cross-platform AST parsing of apply? was unreliable. Pivot: JSON-serializable ExpressionDef of the shape [field, op, value|ref] composed by and/or/not (patterns/expression-def-triples, concepts/json-serializable-dsl), authored in TypeScript with types/enums/composable helpers, compiling to plain JSON; three evaluator implementations (Ruby / TypeScript / Go) under a shared test suite; separate DatabaseLoader owning data fetching; a context_path map resolving which rows to query given the input (resource, user). Evaluation: [[patterns/deny-overrides- allow]] + patterns/progressive-data-loading using concepts/three-valued-logic (true / false / null indeterminate) — load dependency batches in heuristic order, short-circuit as soon as the result is determined. Reported result: "more than halved the total execution time of our permissions evaluation." Additional ecosystem built on the simple evaluator: React front-end debugger rendering per-node truth + data in an expandable boolean tree, CLI debugger with the same output, and CI linter that walks every ExpressionDef and flags e.g. field = ref comparisons without a sibling <> null guard (explicitly chosen over runtime engine enforcement to preserve evaluator simplicity as a cross-platform invariant). Stated outcome: "we all but eliminated incidents and bugs caused by drifts in the logic between our Ruby and LiveGraph codebase." No numbers disclosed for policy count, evaluations/sec, latency distribution, or post-DSL database-load share.)
  • 2026-04-21 — sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale (Figma's Databases team retrospective on scaling RDS Postgres ~100× since 2020. 2020: single Postgres on AWS's largest instance. End of 2022: ≈12 vertically-partitioned RDS Postgres instances + caching + read replicas. Vertical partitioning exhausted by (a) vacuum reliability impact on several-TB tables, (b) per-instance RDS IOPS ceiling approaching on highest-write tables, (c) CPU on hottest partitions. Late 2022: start in-house horizontal sharding. Build-vs- buy: explicit rejection of CockroachDB / TiDB / Spanner / Vitess / NoSQL on 18-month runway pressure + deep RDS operational expertise + complex relational model. Architecture: colos (tables sharing a shard key — UserID, FileID, OrgID — grouped with shared physical layout; cross-table joins + full transactions work when scoped to a single shard-key value), hash-of-shard-key routing (Snowflake-prefixed IDs would hotspot; hash trades range-scan efficiency for uniform distribution), logical sharding decoupled from physical sharding via per-shard Postgres views over a single unsharded instance (<10% worst-case view overhead; percentage rollout feature-flag-gated; seconds- rollback); DBProxy as the Go router between application + PGBouncer (query parser / logical planner / physical planner / scatter-gather / load-shedding / request hedging / transaction support scoped to single shards); shadow application readiness runs the logical planner against live production traffic to pick a sharded-query subset covering 90% of queries without worst-case scatter- gather engine complexity (all range-scans + point-queries allowed; joins only within the same colo on the shard key); full (not filtered) logical replication during reshards. First horizontally-sharded table September 2023: 10 seconds partial primary availability, no replica impact, no latency/availability regressions. 9 months end-to-end for that first table. Open work: horizontally-sharded schema updates, globally-unique ID generation for sharded PKs, atomic cross- shard transactions for business-critical paths, distributed globally-unique indexes, ORM compatible with horizontal sharding, fully-automated one-click reshards. Explicit future- scope: re-evaluate in-house RDS horizontal sharding vs NewSQL / managed alternatives once runway is bought — the choice was shaped by the deadline, not by long-term preference. No latency / throughput / cost / shard-count numbers disclosed.)
  • 2026-04-21 — sources/2026-04-21-figma-how-we-built-ai-powered-search-in-figma (Figma's AI-powered search (shipped at Config 2024) combining visual search (query by screenshot / selected frame / sketch — reverse-image-search lineage) and semantic search (natural text against component names/descriptions/files even when terminology doesn't match). Origin story: June 2023 three-day AI hackathon produced 20 projects including a working design autocomplete prototype. RAG argument framed search as the prerequisite ("we can improve AI outputs with examples from search"). User research on the autocomplete prototype revealed 75% of Figma-canvas objects come from other files — search became the higher-leverage ship. Three product use cases: frame lookup (exact), frame variations (near-similar), broad inspiration (diverse). Indexing policy: patterns/selective-indexing-heuristics stacks (a) UI-shape dimensions filter (top-level frames that look like UI), (b) non-top-level exception for frames meeting the right conditions, (c) near-duplicate collapsing, (d) file-copy skipping, plus experimental ready-for-development quality signals — framed as "we couldn't index everything, it would be too costly." Plus [[patterns/edit-quiescence-indexing|4h no-edit quiescence window]] before indexing (WIP exclusion + load reduction). Quality bar: deliver across similarity tiers simultaneously because users start from close matches — "if we couldn't prove we could find the needle in the haystack, designers wouldn't trust the feature for broader exploration." Eval tool built on Figma's own public plugin API + infinite canvas + keyboard shortcuts for rapid correct/incorrect marking + historical run-to-run comparison (patterns/visual-eval-grading-canvas); eval set seeded from internal-designer interviews + file-browser usage analysis. Surfaced in Actions panel (narrower width) with peek previews + CMD+Enter full-screen drill-down; "rabbit holing" deeper-dive interaction explored and scrapped for simplicity. Shipping principles: AI-for-existing-workflows / rapid iteration / systematic quality checks / cross-disciplinary teamwork. Future work: bring to Figma Community; design autocomplete ship. No architecture / infra / embedding-model / vector-store / latency / NDCG numbers disclosed — product-led post, not systems-led. Introduces systems/figma-ai-search, patterns/selective-indexing-heuristics, patterns/edit-quiescence-indexing, patterns/visual-eval-grading-canvas, concepts/similarity-tier-retrieval; extends concepts/vector-embedding, concepts/vector-similarity-search, concepts/relevance-labeling, patterns/hackathon-to-platform, patterns/prototype-before-production.)
  • 2024-08-08 — sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months (Core compute platform migrated from AWS ECS to EKS in under 12 months, Q1 2023 plan → majority-cutover Jan 2024. Principles: tight migration scope (swap substrate, preserve abstraction) + explicit fast-follows (Keda pod-autoscaling, Vector log forwarding, Graviton, service mesh) + three-active-cluster blast-radius reduction topology + single-step Bazel-config service-definition replacing Terraform-template + deploy two-step + load-test-at-scale "Hello World" at largest-service pod count + weighted-DNS service-by-service traffic cutovers + golden-path-with-escapes UX. Post-migration tooling-UX regression from 3-cluster + RBAC addressed by auto-inferring cluster+role. Enabling condition: Figma's small service count — "not a microservices company." CoreDNS destruction incident on one cluster cost 1/3 of requests instead of full outage.)
  • 2026-04-21 — sources/2026-04-21-figma-rebuilt-foundations-of-component-instances (Year-long client-architecture rewrite replacing the 2016-era Instance Updater with Materializer — a generic framework for maintaining derived subtrees of the document tree from feature-owned blueprints. Component instances become one blueprint; rich text nodes is the first net-new feature built on it; slots (open beta April 2026) composes on top rather than re-implementing reactivity. Reactivity model: explicit choice of push-based invalidation + automatic dependency tracking; pull-based rejected because cross-tree references + deep nesting force reconstructing dep chains on every read. Parallel runtime-orchestration unification unified layout / variable / instance / constraint subsystems under a common framework in predictable execution order, surfacing hidden feedback loops Figma calls "back-dirties"; making them explicit let many be eliminated, moving the client toward unidirectional flow (patterns/runtime-orchestration-unidirectional-flow). Rolled out behind months of side-by-side runtime validation against hundreds of thousands of real production files — compared data model + rendered output + performance, gate: both correctness and performance matched before flip. Canonical reported impact: variable-mode changes in large files 40–50% faster, "representative of broader gains." Perhaps biggest return framed as developer velocity: rich text + slots + other in-progress features ship on the shared framework instead of each reimplementing reactivity. Third Figma client reactive graph over the same object-tree document model after QueryGraph (node deps) and Parameter Runtime (parameter-to-bound-property edges) — Materializer indexes source-of-truth → derived-subtree edges with automatic dep tracking.)
  • 2026-04-21 — sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale (LiveGraph 100x: re-architecture of Figma's real-time GraphQL-like data-fetching service to absorb ~100× growth in sessions + DB-update volume. Sessions tripled since 2021; view requests 5× in the last year. Five named structural failures of the old one-server-with-in-memory-mutation-based-cache design: excessive fan-out, excessive fan-in, tight coupling of reads + updates, fragmented caches, large blast radius from transient shard failures (global-order assumption meant one slow shard stalled all optimistic updates across the product). Two insights unlocked the new design: (1) most LiveGraph traffic is initial reads, not live updates — so invalidation-based caching (re-query on change) is viable; (2) given Figma's schema, most queries are easy to invalidate from the mutation alone — so the invalidator can be stateless. New architecture = three independently-scaling Go services (patterns/independent-scaling-tiers): edge (sessions / view-query expansion / cache subscription / refetch on invalidation), cache (read-through, sharded by hash(easy-expr), cuckoo-filter fan-out of invalidations to edges, hot replicas on standby, deploy decoupled from edge deploy — eliminates thundering herd class), invalidator (sharded like physical DB, tails WAL logical replication stream per shard as CDC, no per-query state). Query shapes: un-parameterized queries have stable IDs; a live query = (shape_id, args); mutations substitute column values into shapes to pop affected queries mechanically. ~700 shapes, only 11 "hard" (range / inequality predicates — potentially infinite fan-out); handled via patterns/nonce-bulk-eviction (co-locate all hard queries with the same easy-expr on one cache shard via hash(easy-expr) sharding; two-layer keys {easy-expr}→nonce + {easy-expr}-{nonce}-{hard-expr}→results; invalidate by deleting the nonce → all hard-query keys orphaned atomically; TTL reaps orphans; edge re-queries only active session hard-queries). Schema discipline: all queries normalize to (easy-expr) AND (hard-expr). Concurrency correctness via read-invalidation rendezvous: (1) same-type coalescing of reads/invalidations, (2) inflight reads interrupted by invalidation must not allow new readers to coalesce onto the stale result, (3) inflight invalidations block racing reads from setting the cache. Validated via chaos test + online cache verification (random-sample cache vs primary DB) + convergence checker against old engine (old engine often seconds slower — required fine-grained tuning). Migration targets the least-scalable cache tier first; rollout mechanics deferred to Braden Walker's Systems@Scale talk (not ingested). No numbers disclosed beyond shape-count (700 / 11) and growth multiples (3× sessions / 5× views / 100× target). Future projects named: automatic invalidator re-sharding, non- Postgres source resolution in the cache, first-class server-side computation like permission evaluation in the cache (crosses paths with systems/figma-permissions-dsl). Introduces systems/livegraph, concepts/invalidation-based-cache, concepts/query-shape, concepts/read-invalidation-rendezvous, concepts/thundering-herd, patterns/stateless-invalidator, patterns/nonce-bulk-eviction, patterns/independent-scaling-tiers; extends systems/dbproxy-figma, systems/postgresql, systems/figma-multiplayer-querygraph (sibling real-time system over a very different data shape), concepts/push-based-invalidation (server-tier instance at DB-row granularity), concepts/change-data-capture (WAL-driven CDC consumer), concepts/wal-write-ahead-logging (Postgres logical replication tap).)

  • 2026-04-21 — sources/2026-04-21-figma-the-search-for-speed-in-figma (performance retrospective on Figma's traditional full-text search after migrating from Elasticsearch to managed OpenSearch in late 2023. Three months of debugging yielded ~60% API-latency reduction, ≥50% max-QPS headroom, >50% cost cut. Key findings: (1) the DataDog "average search = 8 ms" was per-shard, not per-query — with up to ~500 per-shard queries fanning out per user query, coordinator-view latency was actually 150 ms avg / 200–400 ms p99 (canonical concepts/metric-granularity-mismatch; fix = publish the took response-body field as a custom metric). (2) Pre-/post-processing (permissions filter build + per-result permission re-check) was >70% of total time; Ruby runtime-type- safety checks in the permissions path were a real cost. (3) Thread-local DB connection-pool starvation was eating tens of ms per query across all of Figma — fix unlocked previously-abandoned parallel-DB-read experiments retroactively. (4) Index data was bloated, not queries: trim 50% then additional 90% of unused fields with no relevancy impact; the win was fitting the live set in the OS disk cache (concepts/cache-locality at the page- cache tier). (5) 450 → 180 shards (−60%) increased max QPS ≥50% and decreased P50 — documented as patterns/fewer-larger-shards-for-latency; AWS's log-workload sizing guidance doesn't fit latency-sensitive document search with effective pre-filters. (6) Node mix swap: 1/3 CPU + 25% more RAM at ≈1/2 price — CPU was idle, RAM was the constraint. (7) opensearch-benchmark was unusable (vendor-regression-testing tool, client-side latency only); a custom Go harness written in an afternoon produced consistent server-side-took measurements and drove the shard sweep (patterns/custom-benchmarking-harness). (8) Neutral-to- negative: zstd compression was a wash; concurrent segment search added latency even at low QPS. Sibling to the 2026-04-21 AI- search post — same OpenSearch substrate, different query shape.)

Last updated · 200 distilled / 1,178 read