Skip to content

SYSTEM Cited by 1 source

FigCache

FigCache is Figma's in-house stateless, horizontally-scalable, RESP-wire-protocol proxy service sitting between Figma's applications and a fleet of AWS ElastiCache Redis clusters. It acts as a unified Redis data plane: clients see a Redis-compatible endpoint, FigCache routes commands to the right upstream cluster, multiplexes client connections onto a much smaller pool of outbound Redis connections, and layers observability / guardrails / cluster-topology handling below thick client libraries. Rolled out in H2 2025 for Figma's main API service. Post-rollout, Figma's caching layer hit six-nines uptime and Redis cluster connection counts dropped by an order of magnitude.

Why it exists

Before FigCache, Figma's Redis footprint showed several structural failure modes at scale:

  • Connection volume approaching Redis hard limits on critical clusters.
  • Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability.
  • Fragmented client ecosystem — inconsistent observability across Go / Ruby / TypeScript libraries; inconsistent Redis Cluster awareness / TLS config across apps; no fleet-wide guarantee of client-side state correctness during failovers or topology changes.
  • No centralized traffic management — applications could pollute / corrupt data across clusters; cluster-partitioning decisions were duplicated in every application.

Localized remedies (removing Redis dependencies from some API paths; building a bespoke client-side connection pool) isolated outages but didn't close the structural gap. FigCache is the strategic response.

Architecture

Frontend / backend split

  • Frontend layer — client interaction: RESP-based RPC, network I/O, connection management, protocol-aware structured command parsing. Implemented via ResPC (systems/respc).
  • Backend layer — command processing and manipulation, connection multiplexing to storage backends, physical command execution.

Engine tree (backend)

Commands execute through a dynamically-assembled tree of engine nodes:

  • Data engines (leaves) execute commands against Redis.
  • Filter engines (intermediate nodes) route / block / modify commands inline before passing execution to child engines.
  • Processing a command = evaluating this directed graph from the root, with the structured ResPC command as input and a structured ResPC reply as output.

The entire engine tree is expressed in configuration and assembled at runtime during server init. See patterns/starlark-configuration-dsl for the configuration substrate.

Illustrated primitives (from the blog):

  • Router(rules=[Rule(command|prefix, engine)]) — split execution among child engines by command-schema match or key-pattern match.
  • Redis(...) — execute against a specific upstream cluster.
  • Static(reply=...) — return a hard-coded reply (e.g. rejection).

Composition expresses command-type splitting + key-prefix-based routing / rejection + multi-cluster dispatch in pure config, without server-binary redeploys.

Fanout filter engine

Redis Cluster mode rejects multi-key pipelines / transactions spanning different hash slots with CROSSSLOT. FigCache's fanout filter engine intercepts eligible read-only multi-shard pipelines and transparently resolves them as a parallelized scatter-gather — dispatching individual per-shard commands, aggregating responses, returning a normal pipeline reply to the client. Hides a first-class Redis-Cluster protocol limitation behind the proxy.

Cluster-mode emulation

A Redis Cluster emulation shim in the RPC layer exposes FigCache as a fake cluster to cluster-aware clients. Handles the fragmented permutations of cluster-aware / non-aware clients with different TLS and connection-param shapes — making drop-in migration trivial (one-line endpoint change in the simplest case).

Stateless + horizontally scalable

FigCache itself holds no authoritative state — all data is in the backing Redis clusters. The fleet scales horizontally with client load; connection pooling to upstream Redis isolates Redis from fleet churn.

Key capabilities the architecture enables

  • Connection multiplexing — orders-of-magnitude more client connections fan into a finite pool of outbound connections; Redis sees stable connection counts independent of client-fleet size (concepts/connection-multiplexing).
  • Centralized traffic routing via the engine tree — one source of truth for cluster-partitioning, key-prefix → cluster mapping, rejection policies.
  • Language-agnostic, command-semantics-aware observability — availability / throughput / latency / payload size / command cardinality / connection-distribution metrics on every Redis command. Routing-layer traffic classification ascribes ownership metadata (tier / durability / consistency) to every inbound command → hundreds of workloads sliced independently.
  • Custom commands and protocol extensions — the schema-driven command parser lets FigCache implement commands that would otherwise be duplicated in every client library. Article cites a language-agnostic multi-cluster distributed locking abstraction over Redlock, and a protocol-native graceful connection draining mechanism for continuous deployments.
  • Operational event absorption — node failovers, cluster scaling, transient connectivity errors become zero-downtime background events. Shard failovers run liberally and frequently across the entire Redis footprint as live resiliency exercises.
  • Forward-looking pluggability — alternative backend storage technologies behind the same protocol (e.g. durable stores), inline data encryption / compression, priority-aware QoS backpressure, multi-upstream traffic mirroring, customizable command usage restrictions — the article frames these as optionality the architecture enables, not shipped features.

Rollout strategy

  1. Migrate all services to first-party FigCache client wrappers in Go / Ruby / TypeScript. Wrappers sit over existing OSS Redis clients (no proprietary protocol / client rewrite) → interface- compatible, minimal application lift. Original Redis endpoints retained.
  2. Productionize FigCache — tackle all scalability / reliability / operability / observability at the proxy layer.
  3. Gradual, reversible application migration. Per-service opt-in; for large workloads (main API), traffic shifted incrementally across multiple independent domains (never all-or-nothing); feature flags gate the cutover for emergency runtime rollback without code changes or binary deployments.

Latency risk controls

  • Extensive performance evaluations with OSS + internal Redis benchmark tools.
  • Weekly distributed production stress test at ≥10× Figma's typical organic peak throughput — automated ongoing validation of end-to-end capacity under excessive load.
  • Zonal traffic colocation tuned probabilistically across client → FigCache LB → FigCache instance (the patterns/zone-affinity-routing shape applied to a caching proxy tier; cross-AZ hop "up to a few milliseconds").
  • Per-PR CI CPU + memory profiles on critical hot paths + hermetic synthetic benchmarks vs a golden performance baseline; blocks regressions before merge.

Build-vs-buy rationale

FigCache was built in-house despite "classic infrastructure problems, for which solutions exist in open source," because:

  • OSS Redis proxies shipped rudimentary RPC servers unable to extract full annotated arguments from arbitrary RESP commands — blocking generic command-semantics-aware guardrails + custom command interception.
  • Fragmented Figma client ecosystem (mixed cluster-awareness / TLS / connection-param permutations) needed RPC-layer shims that upstream proxies don't offer.
  • Extending OSS proxies with custom business logic required maintaining a source-code fork difficult to keep in sync with upstream.
  • Figma wanted internally composable extensibility — priority- aware QoS, inline encryption / compression, multi-upstream mirroring, customizable restrictions — without forking.

Results

  • Six-nines uptime on the caching layer since H2 2025 rollout for 100% of main-API Redis traffic.
  • Order-of-magnitude reduction in Redis cluster connection counts; volatility sharply reduced despite unchanged diurnal site traffic pattern.
  • Incident-diagnosis time: hours/days → minutes, enabled by end-to-end metrics/logs/traces per command.
  • Cross-team coordination for hardware rotations / cluster topology mods / OS upgrades / security updates downgraded from high-sev incidents to routine zero-downtime background ops.
  • Figma can now formally define a caching-platform SLO and quantify aggregate Redis reliability at Figma.

Team

Built by the Storage Products team in collaboration with many Figma Infrastructure engineers. Named contributors: Justin Palpant, Indy Prentice, Pratik Agarwal, Yichao Zhao (primary); Devang Sampat, Mehant Baid, Alex Sosa, Ankita Shankar, Anna Saplitski, Can Berk Guder, Jim Myers, Lihao He, Manish Jain, Ping-Min Lin.

Seen in

Last updated · 200 distilled / 1,178 read