Skip to content

FIGMA 2026-04-21 Tier 3

Read original ↗

Keeping It 100(x) With Real-time Data At Scale

Summary

Figma re-architected LiveGraph — its real-time GraphQL-like data-fetching service — as a multi-tier system ("LiveGraph 100x") to absorb 100× growth in client sessions and database-update volume. The old design was one Go server with an in-memory mutation-based cache tailing a single globally-ordered Postgres replication stream; it worked at one-primary-Postgres scale but broke under vertical partitioning (no global order) and approaching horizontal sharding (DBProxy-era, see sources/2026-04-21-figma-how-figmas-databases-team-lived-to-tell-the-scale). The new design separates three services — edge (client session + query expansion), cache (read-through, sharded by query hash), and invalidator (stateless, sharded like the database, tails the replication stream and generates invalidations) — connected by two insights: most LiveGraph traffic is initial reads (live updates are a small tail), and most queries are easy to invalidate from row mutations without knowing which queries are currently subscribed. Together these allowed an invalidation-based cache with a stateless invalidator — cleanly decoupled scaling axes, no more thundering herd on deploys, native support for DB sharding.

Key takeaways

  1. Growth profile: client sessions tripled since 2021; view requests 5× in the last year alone. Database beneath LiveGraph shifted from a single Postgres to many vertical + horizontal shards — LiveGraph had to follow.

  2. The old architecture's five fundamental limits (explicitly enumerated in the post):

  3. Excessive fan-out (mutations broadcast to every server).
  4. Excessive fan-in (every server processes every shard's updates).
  5. Tight coupling of reads and updates (only scaling lever = bigger fleet).
  6. Fragmented caches (clients of the same view hit different servers; cache wiped on deploy → thundering herd on the DB).
  7. Large blast radius from transient shard failures — the global- order assumption meant a single slow shard stalled all optimistic updates across the product, even for unrelated shards.

  8. Two key insights that unlocked the new design (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):

  9. "LiveGraph traffic is driven by initial reads" — measured: most result bytes come from initial loads, not live updates. Live updates on active queries are infrequent. → can afford invalidation-based re-query on change without overloading DB.
  10. "Most LiveGraph queries are easy to invalidate" — given a row mutation, you can compute which query shapes are affected from the schema alone, without knowing which queries are actively subscribed. → makes the invalidator stateless.

  11. New architecture (three services, all Go):

  12. Edge — client-facing; expands view requests into queries; subscribes to the cache; re-fetches on invalidation; reconstructs loaded views for the client.
  13. Cache — read-through; sharded by query hash; agnostic to DB topology; on invalidation, evicts and forwards invalidation upstream via a cuckoo filter probabilistic bloom-filter-like structure; hot replicas kept on standby; cache deploy decoupled from edge deploy eliminates the thundering herd.
  14. Invalidatoronly service that knows DB topology; sharded the same way as the physical DBs; tails a single replication stream per shard (WAL-based logical replication, CDC); generates invalidations and sends to relevant cache shards only.

  15. Query shapes: un-parameterized SQL queries have a unique ID; a live query = (shape_id, arg_values). Given a row mutation's pre/post image, the invalidator substitutes column values into each schema query shape → emits invalidations for parameterized queries. No active-query table needed.

  16. Easy vs hard queries: equality predicates → "easy" (exactly one query invalidated per mutation). Range predicates, greater-than, etc. → "hard" (potentially infinite queries). Figma's schema has ~700 query shapes, only 11 are hard. Handled via nonce indirection: shard caches by hash(easy-expr) co-locating all hard queries with the same easy expression; cache at two layers — top-level {easy-expr}→nonce and {easy-expr}-{nonce}-{hard-expr}→results; invalidate by deleting the top-level nonce key → all hard queries with that easy expression are orphaned in one op. Edge re-queries only its session's active hard queries. TTL reaps orphans.

  17. Read-invalidation rendezvous: cannot skip invalidations, and an invalidation arriving mid-read admits "was the read result pre- or post-invalidation?" ambiguity. Three rules enforce eventual consistency:

  18. Operations of the same type can coalesce (two concurrent reads join a single DB query; two invalidations dedupe).
  19. Inflight reads interrupted by invalidation are marked invalidated and wait for existing cache-set to finish; new readers triggered by the invalidation must NOT coalesce onto the invalidated reader.
  20. Inflight invalidations block in-progress reads from setting the cache (would stamp stale results over a just-cleared entry).

Validated by: (a) chaos test (many threads, small key set, high concurrency), (b) online cache verification (random samples compared to primary DB, reports skipped-invalidation cases), (c) convergence checker between old and new engines for side-by-side validation during migration.

  1. Schema discipline — all queries must normalize to (easy-expr) AND (hard-expr). Hard expressions are not invalidated directly; caches shard by easy-expr hash → all hard-queries with the same easy-expr live on the same cache and are evicted together via the nonce trick. The cost: over-invalidation of hard queries — fine because active hard-query counts are small and invalidations are rare.

  2. Independent scaling axes (Source: sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale):

  3. Invalidator scales with DB shard count + update rate.
  4. Edge scales with client-session / view-request volume.
  5. Cache scales with active-query count / hit-rate needs.
  6. Each axis adjusted without disproportionate fan-in/fan-out.

  7. Migration: targeting the old cache tier first (least scalable). Convergence-checker gated cutover — old and new engines run both; queries switch only after byte-exact match. Authors note the old engine was often seconds slower than the new one, complicating convergence tuning.

Architecture diagrams

Old architecture (pre-100x)

Client ──WebSocket──▶ LiveGraph Server (N instances)
                         │   │
                         │   └─ In-memory mutation-based query cache
                         │        (fragmented across N servers; blown
                         │         away on deploy)
                   Combined replication stream
                   (artificial global order across
                    vertical shards — stopgap)
                   RDS Postgres (vertical shards;
                                 approaching horizontal)

Every server tails the combined stream and applies every row mutation to its cache. Fan-in = Σ(all shards' updates) per server. Fan-out = every server receives every mutation.

New architecture (100x)

Client ──WebSocket──▶ Edge (many, session-affined)
                         │     ▲
                     queries   │ invalidations (filtered via
                         │     │ cuckoo filter on subscribed set)
                         ▼     │
                      Cache (sharded by hash(easy-expr))
                         │     ▲
                     query SQL │ invalidations
                         │     │ (only to the right cache shard)
                         ▼     │
                   RDS Postgres shards     Invalidator
                         │                  (stateless;
                     WAL/                    sharded like DB;
                     logical                 knows query shapes;
                     replication ────────▶   substitutes pre/post
                                              image into shape → emits
                                              invalidations for affected
                                              query parameterizations)

Every message goes shard-to-shard, never fleet-broadcast. The invalidator is the only service with DB topology knowledge — edge and cache are DB-topology-agnostic.

Numbers disclosed

  • Sessions tripled since 2021.
  • View requests 5× in the last year.
  • ~700 query shapes in the schema; only 11 are hard (≈1.6%).
  • Migration timeframe: "over the last year and half."
  • No latency / throughput / cost / RPS numbers.
  • No cache-hit-rate numbers.
  • No invalidation-rate numbers.

Architectural levers named by the post

  • Initial-reads dominate live-updates → invalidation cache OK.
  • Schema-local invalidation computability → stateless invalidator.
  • (easy-expr) AND (hard-expr) schema normalizationnonce indirection for hard queries.
  • Deploy cache separately from edge → no more thundering herd.
  • Hot cache replicas on standby → deploys don't cold-start the cache.
  • Cuckoo filters for invalidation forwarding → bounded fan-out even with many subscribers per shard.
  • Invalidation can skip queries but invalidations must never be lostread-invalidation rendezvous semantics.

Caveats / not disclosed

  • No numbers: latency p50/p99, RPS, cache memory footprint, invalidation rate, hard-query invalidation fan-out.
  • The "probabilistic filter" (cuckoo filter) is named once with a link but its role and false-positive trade-off isn't detailed.
  • Rendezvous algorithm is described at the rule level, not as pseudocode — production implementation details omitted.
  • Hard-query schema rule ((easy-expr) AND (hard-expr) normalization) is enforced going forward — mechanism not stated (CI lint? code review? schema tooling?).
  • Migration mechanics (percentage rollout? feature flags? traffic shifting?) deferred to a conference talk link (Braden Walker, Systems@Scale) not ingested here.
  • No discussion of how an invalidator shard crash / restart re-tails WAL without replaying.
  • Re-sharding of invalidators named as a "future project" — not solved yet.

Future work called out

  • Automatic re-sharding of invalidators.
  • Resolving queries from non-Postgres sources in the cache.
  • First-class support for server-side computation (e.g. permissions evaluation — crosses paths with systems/figma-permissions-dsl).

Source

Last updated · 200 distilled / 1,178 read