Skip to content

SYSTEM Cited by 2 sources

LiveGraph

What it is

LiveGraph is Figma's real-time data-fetching service: a GraphQL-like web API in which clients subscribe to queries and receive a JSON tree that stays up-to-date as the underlying data changes. It is not the same system as Figma's document-sync layer (Multiplayer + QueryGraph, which handles the file canvas / editor object tree); LiveGraph powers everything else: comments, FigJam voting sessions, file lists, team membership, optimistic UI for user actions. A custom React Hook auto-re-renders the front-end on update with no engineer bookkeeping.

LiveGraph learns about changes by tailing the PostgreSQL WAL logical replication stream — the same stream that keeps replicas up-to-date. It is therefore a CDC consumer by construction.

Two eras

Old architecture (pre-100x; through ~2023)

  • One server with a large in-memory query cache.
  • Cache was mutation-based: every row change flowing through the replication stream was applied directly to the cached query result (no re-query).
  • Tailing a single globally-ordered replication stream — reasonable when Figma ran on one primary Postgres instance.
  • Simple but broke at scale:
  • Sessions tripled since 2021; view requests 5× in the last year.
  • Database moved to vertical partitioning (≈12 RDS shards) — global order broke. Stopgap: artificially combined all replication streams into one ordered stream, preserving the mutation-based-cache assumption.
  • Database now moving to horizontal sharding via DBProxy — stopgap doesn't scale there.

Five named structural failures (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):

  1. Excessive fan-out — every mutation broadcast to every server.
  2. Excessive fan-in — every server processed every shard's updates.
  3. Tight coupling of read traffic and update traffic (one scaling lever: fleet size).
  4. Fragmented caches — clients of the same view hit different servers; deploys wiped in-server caches → thundering herd on the DB on every release.
  5. Large blast radius from transient shard failures — one slow shard stalled all optimistic updates across the product (comments unavailable for everyone even though only one shard was slow), because the global stream could advance only if every shard was producing updates.

New architecture ("LiveGraph 100x"; 2024–2026, rolling out)

Three Go services, each horizontally scalable on a different axis (patterns/independent-scaling-tiers):

1. Edge

  • Client-facing WebSocket endpoint.
  • Expands client view requests into multiple underlying queries.
  • Subscribes those queries to the cache.
  • On invalidation, re-fetches from the cache (which in turn hits the DB if needed) and reconstructs the loaded view back to the client.
  • Holds per-session subscription state (active queries).
  • Deployed separately from the cache (see "no more thundering herd").

2. Cache

  • Read-through query cache, storing database query results.
  • Sharded by hash(easy-expr) (see query shapes and nonce-bulk-eviction) — not by query hash alone — so all "hard" queries with the same easy expression land on the same node and can be invalidated together.
  • Agnostic to DB topology — does not know how the DB is sharded; only cares about its own hash range.
  • On invalidation: evicts affected entries, forwards the invalidation to subscribed edges via a cuckoo filter (probabilistic filter, cheaper than tracking every active subscription per edge).
  • Hot replicas on standby + cache-deploy decoupled from edge-deploy → eliminates the old thundering-herd-on-deploy class of incident.

3. Invalidator

  • Stateless (patterns/stateless-invalidator) — does not track which queries are currently subscribed. All it needs is the schema and the stream of row mutations.
  • Sharded the same way as the physical DBs — the only LiveGraph service that knows DB topology.
  • Tails a single WAL replication stream per DB shard.
  • For each row mutation's pre/post image, iterates query shapes in the schema, substitutes column values into each shape's parameters, and emits invalidation messages to the relevant cache shards (never broadcasts).

The two key insights that unlocked the design

  1. "LiveGraph traffic is driven by initial reads." When instrumented, the majority of bytes shipped to clients came from initial loads, not from live updates. Invalidation-based caching (fetch-on-change) was therefore viable — re-querying on invalidation is rare in practice. This was non-obvious given LiveGraph's raison d'être is real-time updates.

  2. "Most LiveGraph queries are easy to invalidate." A schema inspection tool found that in almost all cases, given a row mutation, you can determine which queries to refetch from the schema alone, without knowing which queries are subscribed. Specifically, equality-predicate queries ("easy") substitute mutation column values into the shape → exactly the affected parameterized queries pop out. This is what makes the invalidator stateless — and is contrasted with Asana's Worldstore, which is designed quite differently because its query structure doesn't admit the same shortcut.

Easy vs hard queries

(query-shape expands on this.)

All LiveGraph queries are forced to normalize to:

(easy-expr) AND (hard-expr)
  • Easy expression — equality predicates only (file_id = ?, author_id = ?). Exactly one parameterized query per mutation. Invalidate directly.
  • Hard expression — ranges / inequalities (created_at > ?, score BETWEEN ? AND ?). Infinite-fan-out in principle (anything "before the new created_at"). ~11 of ~700 query shapes in Figma's schema are hard.

Nonce-indirection trick for hard queries

(patterns/nonce-bulk-eviction)

Cache sharded by hash(easy-expr) (not hash(query)) so all hard queries with the same easy-expr live on the same cache instance. Two-layer key:

  • Top-level key: {easy-expr} → stores a nonce (UUID).
  • Actual key: {easy-expr}-{nonce}-{hard-expr} → stores the DB result.

Invalidation semantics:

  • Invalidator emits invalidations for easy expressions only.
  • Cache handles the invalidation by deleting the top-level key → the nonce is gone → all hard queries whose keys embed that nonce are orphaned in one atomic delete, even though there might be many of them.
  • Future reads fetch a new nonce; old entries aged out by TTL.
  • Edge re-queries only its session's active hard queries — bounding the fan-out to active subscribers, not the theoretically-infinite hard-query space.

Trade-off: over-invalidation of hard queries (all of them, even ones that mathematically weren't affected). Cheap because active hard-query counts are small and invalidations are rare.

Read-invalidation rendezvous

(concepts/read-invalidation-rendezvous)

Correctness contract: invalidations are never skipped, and reads must converge on post-invalidation data eventually.

Problem: an in-flight cache read concurrent with an invalidation is ambiguous — its result could be from before or after the invalidation.

Rules enforced by a synchronization layer above the in-memory cache:

  1. Same-type operations can coalesce (two concurrent reads on the same key share one DB round-trip; two invalidations dedupe). Cuts capacity spikes.
  2. Reads interrupted by an invalidation are marked invalidated; ongoing cache-set completes; new readers caused by the invalidation must not coalesce with the invalidated reader (else the invalidation is effectively skipped upstream).
  3. Invalidations in flight block concurrent reads from setting the cache (reads might race the delete and leave stale values behind).

Validation

  • Chaos test — many threads, small key set, high concurrency; runs pre-ship.
  • Online cache verification — periodic random samples compared against the primary DB; reports whether an invalidation was skipped.
  • Convergence checker between old and new engines during migration — side-by-side validation; old-engine queries sometimes seconds slower, requiring fine-grained tuning.

Why this architecture eliminates the old failures

Old failure Fix
Excessive fan-out Invalidations are shard-to-shard; edge filters via cuckoo filter
Excessive fan-in Each cache shard only consumes its hash-range invalidations; each invalidator shard only sees its DB shard's WAL
Tight coupling of reads/updates Three independent scaling axes (patterns/independent-scaling-tiers)
Fragmented caches + deploy thundering herd Cache is now global + hash-sharded; hot replicas; cache-deploy decoupled from edge-deploy
Transient-shard blast radius No global-order requirement; a slow shard stalls only its own invalidations

What LiveGraph integrates with

  • Underlying DB — RDS Postgres via DBProxy, vertically partitioned and horizontally sharded. LiveGraph learns of changes by tailing WAL-based logical replication.
  • Permissions — server-side permission checks cross into systems/figma-permissions-dsl's LiveGraph-TypeScript evaluator (historic source of cross-platform drift, solved by the DSL; see sources/2026-04-21-figma-how-we-built-a-custom-permissions-dsl). Future work names "first-class support for server-side computation like permissions" in LiveGraph cache itself.
  • Front-end — custom React Hook auto-re-renders on subscription updates; engineers don't write subscription/diff code.

Relation to Multiplayer + QueryGraph

  • LiveGraph = data-fetching service for non-document state (comments, file lists, voting, etc.) backed by Postgres.
  • Multiplayer + QueryGraph = real-time-collaboration server for document canvas state (vectors, layers, auto-layout) backed by in-memory per-process state.

They are structurally different systems for different data shapes. Multiplayer is one-process-per-document, CRDT-inspired, memory- resident file graph. LiveGraph is DB-backed GraphQL-over-WebSocket with shared cache/invalidator tiers. Both exhibit push-based invalidation but at very different granularities and with different source-of-truth contracts.

What the post does not disclose

  • Latency / throughput / cost numbers.
  • Cache hit rate numbers.
  • Invalidation rate / invalidation fan-out per mutation.
  • Rendezvous algorithm pseudocode.
  • Cuckoo-filter parameters (target false-positive rate, filter size).
  • Invalidator restart / WAL re-tailing protocol.
  • Edge-subscription fan-out and per-edge memory footprint.
  • Migration mechanics (percentage rollout / feature flags / etc. — deferred to conference talk).
  • How the (easy-expr) AND (hard-expr) schema rule is enforced going forward (CI? code review? schema compiler?).
  • How schema evolution (new query shapes) propagates to services.

Future work (named in the post)

  • Automatic re-sharding of invalidators.
  • Resolving queries from non-Postgres sources in the cache.
  • First-class server-side computation (e.g. permissions) inside the cache tier.

Seen in

Last updated · 200 distilled / 1,178 read