Keeping It 100(x) With Real-time Data At Scale¶
Summary¶
Figma re-architected LiveGraph — its real-time GraphQL-like data-fetching service — as a multi-tier system ("LiveGraph 100x") to absorb 100× growth in client sessions and database-update volume. The old design was one Go server with an in-memory mutation-based cache tailing a single globally-ordered Postgres replication stream; it worked at one-primary-Postgres scale but broke under vertical partitioning (no global order) and approaching horizontal sharding (DBProxy-era, see sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale). The new design separates three services — edge (client session + query expansion), cache (read-through, sharded by query hash), and invalidator (stateless, sharded like the database, tails the replication stream and generates invalidations) — connected by two insights: most LiveGraph traffic is initial reads (live updates are a small tail), and most queries are easy to invalidate from row mutations without knowing which queries are currently subscribed. Together these allowed an invalidation-based cache with a stateless invalidator — cleanly decoupled scaling axes, no more thundering herd on deploys, native support for DB sharding.
Key takeaways¶
-
Growth profile: client sessions tripled since 2021; view requests 5× in the last year alone. Database beneath LiveGraph shifted from a single Postgres to many vertical + horizontal shards — LiveGraph had to follow.
-
The old architecture's five fundamental limits (explicitly enumerated in the post):
- Excessive fan-out (mutations broadcast to every server).
- Excessive fan-in (every server processes every shard's updates).
- Tight coupling of reads and updates (only scaling lever = bigger fleet).
- Fragmented caches (clients of the same view hit different servers; cache wiped on deploy → thundering herd on the DB).
-
Large blast radius from transient shard failures — the global- order assumption meant a single slow shard stalled all optimistic updates across the product, even for unrelated shards.
-
Two key insights that unlocked the new design (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):
- "LiveGraph traffic is driven by initial reads" — measured: most result bytes come from initial loads, not live updates. Live updates on active queries are infrequent. → can afford invalidation-based re-query on change without overloading DB.
-
"Most LiveGraph queries are easy to invalidate" — given a row mutation, you can compute which query shapes are affected from the schema alone, without knowing which queries are actively subscribed. → makes the invalidator stateless.
-
New architecture (three services, all Go):
- Edge — client-facing; expands view requests into queries; subscribes to the cache; re-fetches on invalidation; reconstructs loaded views for the client.
- Cache — read-through; sharded by query hash; agnostic to DB topology; on invalidation, evicts and forwards invalidation upstream via a cuckoo filter probabilistic bloom-filter-like structure; hot replicas kept on standby; cache deploy decoupled from edge deploy eliminates the thundering herd.
-
Invalidator — only service that knows DB topology; sharded the same way as the physical DBs; tails a single replication stream per shard (WAL-based logical replication, CDC); generates invalidations and sends to relevant cache shards only.
-
Query shapes: un-parameterized SQL queries have a unique ID; a live query =
(shape_id, arg_values). Given a row mutation's pre/post image, the invalidator substitutes column values into each schema query shape → emits invalidations for parameterized queries. No active-query table needed. -
Easy vs hard queries: equality predicates → "easy" (exactly one query invalidated per mutation). Range predicates, greater-than, etc. → "hard" (potentially infinite queries). Figma's schema has ~700 query shapes, only 11 are hard. Handled via nonce indirection: shard caches by
hash(easy-expr)co-locating all hard queries with the same easy expression; cache at two layers — top-level{easy-expr}→nonceand{easy-expr}-{nonce}-{hard-expr}→results; invalidate by deleting the top-levelnoncekey → all hard queries with that easy expression are orphaned in one op. Edge re-queries only its session's active hard queries. TTL reaps orphans. -
Read-invalidation rendezvous: cannot skip invalidations, and an invalidation arriving mid-read admits "was the read result pre- or post-invalidation?" ambiguity. Three rules enforce eventual consistency:
- Operations of the same type can coalesce (two concurrent reads join a single DB query; two invalidations dedupe).
- Inflight reads interrupted by invalidation are marked invalidated and wait for existing cache-set to finish; new readers triggered by the invalidation must NOT coalesce onto the invalidated reader.
- Inflight invalidations block in-progress reads from setting the cache (would stamp stale results over a just-cleared entry).
Validated by: (a) chaos test (many threads, small key set, high concurrency), (b) online cache verification (random samples compared to primary DB, reports skipped-invalidation cases), (c) convergence checker between old and new engines for side-by-side validation during migration.
-
Schema discipline — all queries must normalize to
(easy-expr) AND (hard-expr). Hard expressions are not invalidated directly; caches shard by easy-expr hash → all hard-queries with the same easy-expr live on the same cache and are evicted together via the nonce trick. The cost: over-invalidation of hard queries — fine because active hard-query counts are small and invalidations are rare. -
Independent scaling axes (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):
- Invalidator scales with DB shard count + update rate.
- Edge scales with client-session / view-request volume.
- Cache scales with active-query count / hit-rate needs.
-
Each axis adjusted without disproportionate fan-in/fan-out.
-
Migration: targeting the old cache tier first (least scalable). Convergence-checker gated cutover — old and new engines run both; queries switch only after byte-exact match. Authors note the old engine was often seconds slower than the new one, complicating convergence tuning.
Architecture diagrams¶
Old architecture (pre-100x)¶
Client ──WebSocket──▶ LiveGraph Server (N instances)
│ │
│ └─ In-memory mutation-based query cache
│ (fragmented across N servers; blown
│ away on deploy)
│
▼
Combined replication stream
(artificial global order across
vertical shards — stopgap)
│
▼
RDS Postgres (vertical shards;
approaching horizontal)
Every server tails the combined stream and applies every row mutation to its cache. Fan-in = Σ(all shards' updates) per server. Fan-out = every server receives every mutation.
New architecture (100x)¶
Client ──WebSocket──▶ Edge (many, session-affined)
│ ▲
queries │ invalidations (filtered via
│ │ cuckoo filter on subscribed set)
▼ │
Cache (sharded by hash(easy-expr))
│ ▲
query SQL │ invalidations
│ │ (only to the right cache shard)
▼ │
RDS Postgres shards Invalidator
│ (stateless;
WAL/ sharded like DB;
logical knows query shapes;
replication ────────▶ substitutes pre/post
image into shape → emits
invalidations for affected
query parameterizations)
Every message goes shard-to-shard, never fleet-broadcast. The invalidator is the only service with DB topology knowledge — edge and cache are DB-topology-agnostic.
Numbers disclosed¶
- Sessions tripled since 2021.
- View requests 5× in the last year.
- ~700 query shapes in the schema; only 11 are hard (≈1.6%).
- Migration timeframe: "over the last year and half."
- No latency / throughput / cost / RPS numbers.
- No cache-hit-rate numbers.
- No invalidation-rate numbers.
Architectural levers named by the post¶
- Initial-reads dominate live-updates → invalidation cache OK.
- Schema-local invalidation computability → stateless invalidator.
(easy-expr) AND (hard-expr)schema normalization → nonce indirection for hard queries.- Deploy cache separately from edge → no more thundering herd.
- Hot cache replicas on standby → deploys don't cold-start the cache.
- Cuckoo filters for invalidation forwarding → bounded fan-out even with many subscribers per shard.
- Invalidation can skip queries but invalidations must never be lost → read-invalidation rendezvous semantics.
Caveats / not disclosed¶
- No numbers: latency p50/p99, RPS, cache memory footprint, invalidation rate, hard-query invalidation fan-out.
- The "probabilistic filter" (cuckoo filter) is named once with a link but its role and false-positive trade-off isn't detailed.
- Rendezvous algorithm is described at the rule level, not as pseudocode — production implementation details omitted.
- Hard-query schema rule (
(easy-expr) AND (hard-expr)normalization) is enforced going forward — mechanism not stated (CI lint? code review? schema tooling?). - Migration mechanics (percentage rollout? feature flags? traffic shifting?) deferred to a conference talk link (Braden Walker, Systems@Scale) not ingested here.
- No discussion of how an invalidator shard crash / restart re-tails WAL without replaying.
- Re-sharding of invalidators named as a "future project" — not solved yet.
Future work called out¶
- Automatic re-sharding of invalidators.
- Resolving queries from non-Postgres sources in the cache.
- First-class support for server-side computation (e.g. permissions evaluation — crosses paths with systems/figma-permissions-dsl).
Source¶
- Figma Engineering — Keeping it 100(x) with real-time data at scale (2026-04-21)
- Raw:
raw/figma/2026-04-21-keeping-it-100x-with-real-time-data-at-scale-01c2a64d.md - Talk referenced: Braden Walker, Systems@Scale — https://www.youtube.com/watch?v=bnvF-IsQaUE (not ingested).
- Predecessor post: GraphQL, meet LiveGraph: a real-time data system at scale (not in raw/ yet).
- Team credit: Bereket Abraham, Braden Walker, Cynthia Vu, Deepan Saravanan, Elliot Lynde, Jon Emerson, Julian Li, Leslie Tu, Lin Xu, Matthew Chiang, Paul Langton, Tahmid Haque. Authors: Rudi Chen, Slava Kim.