Skip to content

Figma — Figma's Next-Generation Data Caching Platform

Summary

Figma's Storage Products team built FigCache — a stateless, RESP-wire-protocol proxy service sitting in front of AWS ElastiCache Redis clusters — plus a suite of first-party client libraries in Go, Ruby, and TypeScript. Rolled out through H2 2025 for Figma's main API service, the caching layer reached six-nines uptime post-cutover. The core win is decoupling Redis connection volume from client fleet size: FigCache is a connection multiplexer that isolates Redis from thundering herds of new connections when client services scale up rapidly. Secondary wins: language-agnostic command-level observability (availability, throughput, latency, payload size, command cardinality, connection distribution) slicing operational metrics across hundreds of unique workloads by tier / durability / consistency; topology-change and failover events downgraded from high-severity incidents to zero- downtime background operations; a pluggable engine-tree command pipeline configured in Starlark lets operators express complex runtime behaviors (command routing, key-prefix filtering, multi-cluster dispatch, fanout cross-shard pipelines) in config, not in server binaries. ResPC — the in-house Go library that parses raw RESP byte streams into semantically rich structured commands via a schema registry — is the entry point to the proxy's processing engine and the extensibility substrate the article credits for avoiding the maintenance burden of forking an open-source proxy.

Key takeaways

  1. Connection multiplexing is the architectural center. At Figma scale, Redis Cluster connection limits + fleet-elasticity-driven thundering herds became a critical-path site-availability risk. FigCache multiplexes orders-of-magnitude more client connections onto a finite set of outbound Redis connections; post-rollout, "connection counts on Redis clusters dropped by an order of magnitude across the board, and became significantly less volatile despite an unchanged, diurnal site traffic pattern." This is the canonical shape of a connection- multiplexing proxy tier.
  2. Protocol-compatible drop-in proxy. FigCache speaks RESP (Redis's native wire protocol) so migrating a client is a one-line endpoint configuration change. A "Redis Cluster mode emulation layer" exposes the proxy as a fake cluster to cluster-aware clients, transparently handling fragmented TLS / cluster-awareness / connection-param permutations. Canonical patterns/protocol-compatible-drop-in-proxy.
  3. Build-vs-buy: custom proxy won on protocol extensibility. Existing OSS Redis proxies shipped "rudimentary RPC servers" unable to extract full annotated arguments from arbitrary inbound commands — blocking generic, command-semantics-aware guardrails and custom commands. Examples cited: language-agnostic multi- cluster distributed locking over Redlock; protocol-native graceful connection draining; priority-aware QoS backpressure; inline data encryption / compression; multi-upstream traffic mirroring; customizable command usage restrictions. Extending an OSS proxy meant "maintenance of a source code fork that would be difficult to keep in sync with upstream."
  4. ResPC: schema-driven RESP RPC framework. Four components: a server layer (connection accept + in-memory client state + network I/O), a streaming RESP protocol parser (incremental parsing / serialization), a schema-driven structured command parser (a schema registry declaratively expresses supported command sequences with annotated arguments — the enabler for command-semantics-aware processing), and an implementation-agnostic command dispatch layer. RESP + RPC = ResPC. Canonical example of making a wire protocol into a structured-RPC surface.
  5. Engine tree: pluggable command pipeline configured in Starlark. Backend is a dynamically-assembled tree of engine nodes; each node accepts a structured ResPC command and produces a structured ResPC reply. Leaves are data engines (execute against Redis); intermediate nodes are filter engines (route / block / modify commands inline). Processing = evaluating the DAG from root. The tree is expressed entirely in configuration (Starlark program, evaluated at init-time in a VM, rendering a Protobuf config definition); operators compose Router + Redis + Static primitives to express command-type splitting, key-prefix-based routing / rejection, multi-cluster dispatch without server-binary redeploys. Canonical patterns/starlark-configuration-dsl.
  6. Fanout engine for cross-shard read-only pipelines. Redis cluster-mode returns CROSSSLOT on multi-key operations spanning different hash slots. FigCache's fanout filter engine intercepts eligible multi-shard read-only pipelines and transparently resolves them as a parallelized scatter-gather — client sees normal pipeline semantics, FigCache dispatches per-shard commands in parallel and aggregates the responses. Hides a first-class Redis-Cluster-protocol limitation behind the proxy.
  7. Three-phase delivery + zonal traffic colocation. Phase 1: migrate all services to first-party client wrappers (interface- compatible with existing OSS clients — minimal application lift) without otherwise changing endpoints. Phase 2: build and productionize the proxy. Phase 3: gradual reversible migration of applications to the proxy with carefully-sequenced rollout (per-service + per-workload domain + feature-flag-gated for runtime reversibility). Latency mitigations: routing config probabilistically prefers zonal traffic colocation across client → FigCache LBs → FigCache instances (cross-AZ hop is "as much as a few milliseconds"); weekly distributed production stress test at >10× peak; per-PR CI CPU/mem profiles + synthetic benchmarks vs a golden baseline.
  8. Observability as platform guarantee, not application concern. Post-rollout: full-stack metrics/logs/traces on every Redis command; incident-diagnosis time went from "hours or days to minutes." FigCache's routing-layer traffic classification engine ascribes ownership metadata to every inbound command — slicing operational metrics across hundreds of unique workloads by tier / durability / consistency / etc. Enabled a formal caching-platform SLO. Standardizing as a universal Redis access tier also let Figma downgrade ElastiCache operational events (node failovers, scaling, transient connectivity errors) to zero-downtime background ops; "shard failovers now require zero operator intervention, and are executed liberally and frequently across our entire Redis footprint — partially to serve as a regular, production- environment live exercise of the system's built-in resiliency."

Numbers

  • Six nines uptime (99.9999%) on Figma's caching layer since H2 2025 rollout for the main API service.
  • Order-of-magnitude reduction in Redis cluster connection counts post-rollout for 100% of main-API Redis traffic; significantly reduced volatility.
  • Weekly production stress test at ≥10× typical organic peak throughput.
  • Cross-AZ latency penalty: up to a few milliseconds per hop.
  • Hundreds of unique application workloads sliced in operational metrics.

Architectural mechanics

  • RESP (Redis Serialization Protocol) is the text-encoded wire format between Redis clients and servers. FigCache speaks RESP unmodified to client applications. ResPC parses the byte stream incrementally; a schema registry drives structured argument extraction; the command is then routed through the engine tree.
  • Engine-tree primitives illustrated in the post (Starlark): Router with match rules (command schema match, key prefix match, passthrough); Redis (execute against a specific cluster); Static (return a hard-coded reply — used for rejection messages). Composition expresses command-type splitting + key-prefix-based routing + per-cluster dispatch + rejection in pure config.
  • Frontend / backend split: frontend = client interaction (RESP RPC, network I/O, connection management, protocol-aware command parsing). Backend = command processing / manipulation + connection multiplexing to storage backends + physical command execution.
  • Routing-level zonal colocation is the latency control: routing config is tuned to prefer zonal colocation across client → FigCache LB → FigCache instance; this is the patterns/zone-affinity-routing shape applied to a caching proxy tier. "Probabilistic" preference (not hard affinity) so the penalty is predictable rather than creating local brownouts.
  • Figma Redis footprint shape: many clusters with varying degrees of isolation requirements, durability expectations, criticality characteristics, and traffic volumes. Centralized routing mediates cluster partitioning so application-side complexity of configuring multiple endpoints disappears.

Caveats

  • Latency overhead is implicitly accepted, not quantified. The post acknowledges "introducing a proxy tier necessarily adds latency due to additional critical-path network hops and layers of I/O" and lists three mitigation strategies (benchmark suite + weekly production stress test + zonal colocation + per-PR CI profiling + synthetic benchmarks). But no end-to-end p50 / p99 / p999 numbers are disclosed — so the absolute cost of the proxy hop is unquantified publicly.
  • Connection-count reduction is the only reliability metric given. Six-nines uptime is claimed as the post-rollout outcome but not attributed to a specific mechanism with its own numbers.
  • Build-vs-buy evaluation lists concrete OSS limitations but doesn't name the OSS proxies evaluated (envoyproxy/envoy w/ Redis filter, Twemproxy, Envoyredis, Redisson-proxy, Dynomite, etc.).
  • No throughput / QPS / payload-size distribution disclosed.
  • No cost breakdown — FigCache fleet size, per-instance connection fan-in ratio, cost vs OSS-proxy estimate, etc.
  • Pluggability claims for "alternative backend storage systems" and features like priority-aware QoS backpressure / inline encryption / multi-upstream mirroring are forward-looking — the post frames them as "forward-looking optionality" the architecture enables, not production features.
  • No SRE-model / on-call-handoff detail for the new universal access tier; "fleet-wide guarantees about client-side state correctness during failovers" claim is asserted rather than dissected.
  • ResPC is described as an in-house Go library; no mention of open-sourcing.

Introduces

Extends

Source

Last updated · 200 distilled / 1,178 read