FIGMA 2026-04-21 Tier 3

Keeping It 100(x) With Real-time Data At Scale¶

Summary¶

Figma re-architected LiveGraph — its real-time GraphQL-like data-fetching service — as a multi-tier system ("LiveGraph 100x") to absorb 100× growth in client sessions and database-update volume. The old design was one Go server with an in-memory mutation-based cache tailing a single globally-ordered Postgres replication stream; it worked at one-primary-Postgres scale but broke under vertical partitioning (no global order) and approaching horizontal sharding (DBProxy-era, see sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale). The new design separates three services — edge (client session + query expansion), cache (read-through, sharded by query hash), and invalidator (stateless, sharded like the database, tails the replication stream and generates invalidations) — connected by two insights: most LiveGraph traffic is initial reads (live updates are a small tail), and most queries are easy to invalidate from row mutations without knowing which queries are currently subscribed. Together these allowed an invalidation-based cache with a stateless invalidator — cleanly decoupled scaling axes, no more thundering herd on deploys, native support for DB sharding.

Key takeaways¶

Growth profile: client sessions tripled since 2021; view requests 5× in the last year alone. Database beneath LiveGraph shifted from a single Postgres to many vertical + horizontal shards — LiveGraph had to follow.
The old architecture's five fundamental limits (explicitly enumerated in the post):
Excessive fan-out (mutations broadcast to every server).
Excessive fan-in (every server processes every shard's updates).
Tight coupling of reads and updates (only scaling lever = bigger fleet).
Fragmented caches (clients of the same view hit different servers; cache wiped on deploy → thundering herd on the DB).
Large blast radius from transient shard failures — the global- order assumption meant a single slow shard stalled all optimistic updates across the product, even for unrelated shards.
Two key insights that unlocked the new design (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):
"LiveGraph traffic is driven by initial reads" — measured: most result bytes come from initial loads, not live updates. Live updates on active queries are infrequent. → can afford invalidation-based re-query on change without overloading DB.
"Most LiveGraph queries are easy to invalidate" — given a row mutation, you can compute which query shapes are affected from the schema alone, without knowing which queries are actively subscribed. → makes the invalidator stateless.
New architecture (three services, all Go):
Edge — client-facing; expands view requests into queries; subscribes to the cache; re-fetches on invalidation; reconstructs loaded views for the client.
Cache — read-through; sharded by query hash; agnostic to DB topology; on invalidation, evicts and forwards invalidation upstream via a cuckoo filter probabilistic bloom-filter-like structure; hot replicas kept on standby; cache deploy decoupled from edge deploy eliminates the thundering herd.
Invalidator — only service that knows DB topology; sharded the same way as the physical DBs; tails a single replication stream per shard (WAL-based logical replication, CDC); generates invalidations and sends to relevant cache shards only.
Query shapes: un-parameterized SQL queries have a unique ID; a live query = (shape_id, arg_values). Given a row mutation's pre/post image, the invalidator substitutes column values into each schema query shape → emits invalidations for parameterized queries. No active-query table needed.
Easy vs hard queries: equality predicates → "easy" (exactly one query invalidated per mutation). Range predicates, greater-than, etc. → "hard" (potentially infinite queries). Figma's schema has ~700 query shapes, only 11 are hard. Handled via nonce indirection: shard caches by hash(easy-expr) co-locating all hard queries with the same easy expression; cache at two layers — top-level {easy-expr}→nonce and {easy-expr}-{nonce}-{hard-expr}→results; invalidate by deleting the top-level nonce key → all hard queries with that easy expression are orphaned in one op. Edge re-queries only its session's active hard queries. TTL reaps orphans.
Read-invalidation rendezvous: cannot skip invalidations, and an invalidation arriving mid-read admits "was the read result pre- or post-invalidation?" ambiguity. Three rules enforce eventual consistency:
Operations of the same type can coalesce (two concurrent reads join a single DB query; two invalidations dedupe).
Inflight reads interrupted by invalidation are marked invalidated and wait for existing cache-set to finish; new readers triggered by the invalidation must NOT coalesce onto the invalidated reader.
Inflight invalidations block in-progress reads from setting the cache (would stamp stale results over a just-cleared entry).

Validated by: (a) chaos test (many threads, small key set, high concurrency), (b) online cache verification (random samples compared to primary DB, reports skipped-invalidation cases), (c) convergence checker between old and new engines for side-by-side validation during migration.

Schema discipline — all queries must normalize to (easy-expr) AND (hard-expr). Hard expressions are not invalidated directly; caches shard by easy-expr hash → all hard-queries with the same easy-expr live on the same cache and are evicted together via the nonce trick. The cost: over-invalidation of hard queries — fine because active hard-query counts are small and invalidations are rare.
Independent scaling axes (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):
Invalidator scales with DB shard count + update rate.
Edge scales with client-session / view-request volume.
Cache scales with active-query count / hit-rate needs.
Each axis adjusted without disproportionate fan-in/fan-out.
Migration: targeting the old cache tier first (least scalable). Convergence-checker gated cutover — old and new engines run both; queries switch only after byte-exact match. Authors note the old engine was often seconds slower than the new one, complicating convergence tuning.

Architecture diagrams¶

Old architecture (pre-100x)¶

Client ──WebSocket──▶ LiveGraph Server (N instances)
                         │   │
                         │   └─ In-memory mutation-based query cache
                         │        (fragmented across N servers; blown
                         │         away on deploy)
                         │
                         ▼
                   Combined replication stream
                   (artificial global order across
                    vertical shards — stopgap)
                         │
                         ▼
                   RDS Postgres (vertical shards;
                                 approaching horizontal)

Every server tails the combined stream and applies every row mutation to its cache. Fan-in = Σ(all shards' updates) per server. Fan-out = every server receives every mutation.

New architecture (100x)¶

Client ──WebSocket──▶ Edge (many, session-affined)
                         │     ▲
                     queries   │ invalidations (filtered via
                         │     │ cuckoo filter on subscribed set)
                         ▼     │
                      Cache (sharded by hash(easy-expr))
                         │     ▲
                     query SQL │ invalidations
                         │     │ (only to the right cache shard)
                         ▼     │
                   RDS Postgres shards     Invalidator
                         │                  (stateless;
                     WAL/                    sharded like DB;
                     logical                 knows query shapes;
                     replication ────────▶   substitutes pre/post
                                              image into shape → emits
                                              invalidations for affected
                                              query parameterizations)

Every message goes shard-to-shard, never fleet-broadcast. The invalidator is the only service with DB topology knowledge — edge and cache are DB-topology-agnostic.

Numbers disclosed¶

Sessions tripled since 2021.
View requests 5× in the last year.
~700 query shapes in the schema; only 11 are hard (≈1.6%).
Migration timeframe: "over the last year and half."
No latency / throughput / cost / RPS numbers.
No cache-hit-rate numbers.
No invalidation-rate numbers.

Architectural levers named by the post¶

Initial-reads dominate live-updates → invalidation cache OK.
Schema-local invalidation computability → stateless invalidator.
(easy-expr) AND (hard-expr) schema normalization → nonce indirection for hard queries.
Deploy cache separately from edge → no more thundering herd.
Hot cache replicas on standby → deploys don't cold-start the cache.
Cuckoo filters for invalidation forwarding → bounded fan-out even with many subscribers per shard.
Invalidation can skip queries but invalidations must never be lost → read-invalidation rendezvous semantics.

Caveats / not disclosed¶

No numbers: latency p50/p99, RPS, cache memory footprint, invalidation rate, hard-query invalidation fan-out.
The "probabilistic filter" (cuckoo filter) is named once with a link but its role and false-positive trade-off isn't detailed.
Rendezvous algorithm is described at the rule level, not as pseudocode — production implementation details omitted.
Hard-query schema rule ((easy-expr) AND (hard-expr) normalization) is enforced going forward — mechanism not stated (CI lint? code review? schema tooling?).
Migration mechanics (percentage rollout? feature flags? traffic shifting?) deferred to a conference talk link (Braden Walker, Systems@Scale) not ingested here.
No discussion of how an invalidator shard crash / restart re-tails WAL without replaying.
Re-sharding of invalidators named as a "future project" — not solved yet.

Future work called out¶

Automatic re-sharding of invalidators.
Resolving queries from non-Postgres sources in the cache.
First-class support for server-side computation (e.g. permissions evaluation — crosses paths with systems/figma-permissions-dsl).

Source¶

Figma Engineering — Keeping it 100(x) with real-time data at scale (2026-04-21)
Raw: raw/figma/2026-04-21-keeping-it-100x-with-real-time-data-at-scale-01c2a64d.md
Talk referenced: Braden Walker, Systems@Scale — https://www.youtube.com/watch?v=bnvF-IsQaUE (not ingested).
Predecessor post: GraphQL, meet LiveGraph: a real-time data system at scale (not in raw/ yet).
Team credit: Bereket Abraham, Braden Walker, Cynthia Vu, Deepan Saravanan, Elliot Lynde, Jon Emerson, Julian Li, Leslie Tu, Lin Xu, Matthew Chiang, Paul Langton, Tahmid Haque. Authors: Rudi Chen, Slava Kim.