SYSTEM Cited by 1 source

FigCache¶

FigCache is Figma's in-house stateless, horizontally-scalable, RESP-wire-protocol proxy service sitting between Figma's applications and a fleet of AWS ElastiCache Redis clusters. It acts as a unified Redis data plane: clients see a Redis-compatible endpoint, FigCache routes commands to the right upstream cluster, multiplexes client connections onto a much smaller pool of outbound Redis connections, and layers observability / guardrails / cluster-topology handling below thick client libraries. Rolled out in H2 2025 for Figma's main API service. Post-rollout, Figma's caching layer hit six-nines uptime and Redis cluster connection counts dropped by an order of magnitude.

Why it exists¶

Before FigCache, Figma's Redis footprint showed several structural failure modes at scale:

Connection volume approaching Redis hard limits on critical clusters.
Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability.
Fragmented client ecosystem — inconsistent observability across Go / Ruby / TypeScript libraries; inconsistent Redis Cluster awareness / TLS config across apps; no fleet-wide guarantee of client-side state correctness during failovers or topology changes.
No centralized traffic management — applications could pollute / corrupt data across clusters; cluster-partitioning decisions were duplicated in every application.

Localized remedies (removing Redis dependencies from some API paths; building a bespoke client-side connection pool) isolated outages but didn't close the structural gap. FigCache is the strategic response.

Architecture¶

Frontend / backend split¶

Frontend layer — client interaction: RESP-based RPC, network I/O, connection management, protocol-aware structured command parsing. Implemented via ResPC (systems/respc).
Backend layer — command processing and manipulation, connection multiplexing to storage backends, physical command execution.

Engine tree (backend)¶

Commands execute through a dynamically-assembled tree of engine nodes:

Data engines (leaves) execute commands against Redis.
Filter engines (intermediate nodes) route / block / modify commands inline before passing execution to child engines.
Processing a command = evaluating this directed graph from the root, with the structured ResPC command as input and a structured ResPC reply as output.

The entire engine tree is expressed in configuration and assembled at runtime during server init. See patterns/starlark-configuration-dsl for the configuration substrate.

Illustrated primitives (from the blog):

Router(rules=[Rule(command|prefix, engine)]) — split execution among child engines by command-schema match or key-pattern match.
Redis(...) — execute against a specific upstream cluster.
Static(reply=...) — return a hard-coded reply (e.g. rejection).

Composition expresses command-type splitting + key-prefix-based routing / rejection + multi-cluster dispatch in pure config, without server-binary redeploys.

Fanout filter engine¶

Redis Cluster mode rejects multi-key pipelines / transactions spanning different hash slots with CROSSSLOT. FigCache's fanout filter engine intercepts eligible read-only multi-shard pipelines and transparently resolves them as a parallelized scatter-gather — dispatching individual per-shard commands, aggregating responses, returning a normal pipeline reply to the client. Hides a first-class Redis-Cluster protocol limitation behind the proxy.

Cluster-mode emulation¶

A Redis Cluster emulation shim in the RPC layer exposes FigCache as a fake cluster to cluster-aware clients. Handles the fragmented permutations of cluster-aware / non-aware clients with different TLS and connection-param shapes — making drop-in migration trivial (one-line endpoint change in the simplest case).

Stateless + horizontally scalable¶

FigCache itself holds no authoritative state — all data is in the backing Redis clusters. The fleet scales horizontally with client load; connection pooling to upstream Redis isolates Redis from fleet churn.

Key capabilities the architecture enables¶

Connection multiplexing — orders-of-magnitude more client connections fan into a finite pool of outbound connections; Redis sees stable connection counts independent of client-fleet size (concepts/connection-multiplexing).
Centralized traffic routing via the engine tree — one source of truth for cluster-partitioning, key-prefix → cluster mapping, rejection policies.
Language-agnostic, command-semantics-aware observability — availability / throughput / latency / payload size / command cardinality / connection-distribution metrics on every Redis command. Routing-layer traffic classification ascribes ownership metadata (tier / durability / consistency) to every inbound command → hundreds of workloads sliced independently.
Custom commands and protocol extensions — the schema-driven command parser lets FigCache implement commands that would otherwise be duplicated in every client library. Article cites a language-agnostic multi-cluster distributed locking abstraction over Redlock, and a protocol-native graceful connection draining mechanism for continuous deployments.
Operational event absorption — node failovers, cluster scaling, transient connectivity errors become zero-downtime background events. Shard failovers run liberally and frequently across the entire Redis footprint as live resiliency exercises.
Forward-looking pluggability — alternative backend storage technologies behind the same protocol (e.g. durable stores), inline data encryption / compression, priority-aware QoS backpressure, multi-upstream traffic mirroring, customizable command usage restrictions — the article frames these as optionality the architecture enables, not shipped features.

Rollout strategy¶

Migrate all services to first-party FigCache client wrappers in Go / Ruby / TypeScript. Wrappers sit over existing OSS Redis clients (no proprietary protocol / client rewrite) → interface- compatible, minimal application lift. Original Redis endpoints retained.
Productionize FigCache — tackle all scalability / reliability / operability / observability at the proxy layer.
Gradual, reversible application migration. Per-service opt-in; for large workloads (main API), traffic shifted incrementally across multiple independent domains (never all-or-nothing); feature flags gate the cutover for emergency runtime rollback without code changes or binary deployments.

Latency risk controls¶

Extensive performance evaluations with OSS + internal Redis benchmark tools.
Weekly distributed production stress test at ≥10× Figma's typical organic peak throughput — automated ongoing validation of end-to-end capacity under excessive load.
Zonal traffic colocation tuned probabilistically across client → FigCache LB → FigCache instance (the patterns/zone-affinity-routing shape applied to a caching proxy tier; cross-AZ hop "up to a few milliseconds").
Per-PR CI CPU + memory profiles on critical hot paths + hermetic synthetic benchmarks vs a golden performance baseline; blocks regressions before merge.

Build-vs-buy rationale¶

FigCache was built in-house despite "classic infrastructure problems, for which solutions exist in open source," because:

OSS Redis proxies shipped rudimentary RPC servers unable to extract full annotated arguments from arbitrary RESP commands — blocking generic command-semantics-aware guardrails + custom command interception.
Fragmented Figma client ecosystem (mixed cluster-awareness / TLS / connection-param permutations) needed RPC-layer shims that upstream proxies don't offer.
Extending OSS proxies with custom business logic required maintaining a source-code fork difficult to keep in sync with upstream.
Figma wanted internally composable extensibility — priority- aware QoS, inline encryption / compression, multi-upstream mirroring, customizable restrictions — without forking.

Results¶

Six-nines uptime on the caching layer since H2 2025 rollout for 100% of main-API Redis traffic.
Order-of-magnitude reduction in Redis cluster connection counts; volatility sharply reduced despite unchanged diurnal site traffic pattern.
Incident-diagnosis time: hours/days → minutes, enabled by end-to-end metrics/logs/traces per command.
Cross-team coordination for hardware rotations / cluster topology mods / OS upgrades / security updates downgraded from high-sev incidents to routine zero-downtime background ops.
Figma can now formally define a caching-platform SLO and quantify aggregate Redis reliability at Figma.

Team¶

Built by the Storage Products team in collaboration with many Figma Infrastructure engineers. Named contributors: Justin Palpant, Indy Prentice, Pratik Agarwal, Yichao Zhao (primary); Devang Sampat, Mehant Baid, Alex Sosa, Ankita Shankar, Anna Saplitski, Can Berk Guder, Jim Myers, Lihao He, Manish Jain, Ping-Min Lin.

Seen in¶

sources/2026-04-21-figma-figcache-next-generation-data-caching-platform — primary source; single-post reveal of the full architecture + rollout strategy.

systems/respc — the Go RESP-RPC framework inside FigCache's frontend
systems/redis, systems/aws-elasticache — the backing stores
patterns/caching-proxy-tier — the architectural pattern
patterns/protocol-compatible-drop-in-proxy — what makes migration a one-line change
patterns/starlark-configuration-dsl — how engine trees are authored
concepts/connection-multiplexing — the central mechanism
concepts/control-plane-data-plane-separation — FigCache config = control plane; engine-tree execution = data plane