SYSTEM Cited by 1 source
FigCache¶
FigCache is Figma's in-house stateless, horizontally-scalable, RESP-wire-protocol proxy service sitting between Figma's applications and a fleet of AWS ElastiCache Redis clusters. It acts as a unified Redis data plane: clients see a Redis-compatible endpoint, FigCache routes commands to the right upstream cluster, multiplexes client connections onto a much smaller pool of outbound Redis connections, and layers observability / guardrails / cluster-topology handling below thick client libraries. Rolled out in H2 2025 for Figma's main API service. Post-rollout, Figma's caching layer hit six-nines uptime and Redis cluster connection counts dropped by an order of magnitude.
Why it exists¶
Before FigCache, Figma's Redis footprint showed several structural failure modes at scale:
- Connection volume approaching Redis hard limits on critical clusters.
- Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability.
- Fragmented client ecosystem — inconsistent observability across Go / Ruby / TypeScript libraries; inconsistent Redis Cluster awareness / TLS config across apps; no fleet-wide guarantee of client-side state correctness during failovers or topology changes.
- No centralized traffic management — applications could pollute / corrupt data across clusters; cluster-partitioning decisions were duplicated in every application.
Localized remedies (removing Redis dependencies from some API paths; building a bespoke client-side connection pool) isolated outages but didn't close the structural gap. FigCache is the strategic response.
Architecture¶
Frontend / backend split¶
- Frontend layer — client interaction: RESP-based RPC, network I/O, connection management, protocol-aware structured command parsing. Implemented via ResPC (systems/respc).
- Backend layer — command processing and manipulation, connection multiplexing to storage backends, physical command execution.
Engine tree (backend)¶
Commands execute through a dynamically-assembled tree of engine nodes:
- Data engines (leaves) execute commands against Redis.
- Filter engines (intermediate nodes) route / block / modify commands inline before passing execution to child engines.
- Processing a command = evaluating this directed graph from the root, with the structured ResPC command as input and a structured ResPC reply as output.
The entire engine tree is expressed in configuration and assembled at runtime during server init. See patterns/starlark-configuration-dsl for the configuration substrate.
Illustrated primitives (from the blog):
Router(rules=[Rule(command|prefix, engine)])— split execution among child engines by command-schema match or key-pattern match.Redis(...)— execute against a specific upstream cluster.Static(reply=...)— return a hard-coded reply (e.g. rejection).
Composition expresses command-type splitting + key-prefix-based routing / rejection + multi-cluster dispatch in pure config, without server-binary redeploys.
Fanout filter engine¶
Redis Cluster mode rejects multi-key pipelines / transactions spanning
different hash slots with CROSSSLOT. FigCache's fanout filter
engine intercepts eligible read-only multi-shard pipelines and
transparently resolves them as a parallelized scatter-gather —
dispatching individual per-shard commands, aggregating responses,
returning a normal pipeline reply to the client. Hides a first-class
Redis-Cluster protocol limitation behind the proxy.
Cluster-mode emulation¶
A Redis Cluster emulation shim in the RPC layer exposes FigCache as a fake cluster to cluster-aware clients. Handles the fragmented permutations of cluster-aware / non-aware clients with different TLS and connection-param shapes — making drop-in migration trivial (one-line endpoint change in the simplest case).
Stateless + horizontally scalable¶
FigCache itself holds no authoritative state — all data is in the backing Redis clusters. The fleet scales horizontally with client load; connection pooling to upstream Redis isolates Redis from fleet churn.
Key capabilities the architecture enables¶
- Connection multiplexing — orders-of-magnitude more client connections fan into a finite pool of outbound connections; Redis sees stable connection counts independent of client-fleet size (concepts/connection-multiplexing).
- Centralized traffic routing via the engine tree — one source of truth for cluster-partitioning, key-prefix → cluster mapping, rejection policies.
- Language-agnostic, command-semantics-aware observability — availability / throughput / latency / payload size / command cardinality / connection-distribution metrics on every Redis command. Routing-layer traffic classification ascribes ownership metadata (tier / durability / consistency) to every inbound command → hundreds of workloads sliced independently.
- Custom commands and protocol extensions — the schema-driven command parser lets FigCache implement commands that would otherwise be duplicated in every client library. Article cites a language-agnostic multi-cluster distributed locking abstraction over Redlock, and a protocol-native graceful connection draining mechanism for continuous deployments.
- Operational event absorption — node failovers, cluster scaling, transient connectivity errors become zero-downtime background events. Shard failovers run liberally and frequently across the entire Redis footprint as live resiliency exercises.
- Forward-looking pluggability — alternative backend storage technologies behind the same protocol (e.g. durable stores), inline data encryption / compression, priority-aware QoS backpressure, multi-upstream traffic mirroring, customizable command usage restrictions — the article frames these as optionality the architecture enables, not shipped features.
Rollout strategy¶
- Migrate all services to first-party FigCache client wrappers in Go / Ruby / TypeScript. Wrappers sit over existing OSS Redis clients (no proprietary protocol / client rewrite) → interface- compatible, minimal application lift. Original Redis endpoints retained.
- Productionize FigCache — tackle all scalability / reliability / operability / observability at the proxy layer.
- Gradual, reversible application migration. Per-service opt-in; for large workloads (main API), traffic shifted incrementally across multiple independent domains (never all-or-nothing); feature flags gate the cutover for emergency runtime rollback without code changes or binary deployments.
Latency risk controls¶
- Extensive performance evaluations with OSS + internal Redis benchmark tools.
- Weekly distributed production stress test at ≥10× Figma's typical organic peak throughput — automated ongoing validation of end-to-end capacity under excessive load.
- Zonal traffic colocation tuned probabilistically across client → FigCache LB → FigCache instance (the patterns/zone-affinity-routing shape applied to a caching proxy tier; cross-AZ hop "up to a few milliseconds").
- Per-PR CI CPU + memory profiles on critical hot paths + hermetic synthetic benchmarks vs a golden performance baseline; blocks regressions before merge.
Build-vs-buy rationale¶
FigCache was built in-house despite "classic infrastructure problems, for which solutions exist in open source," because:
- OSS Redis proxies shipped rudimentary RPC servers unable to extract full annotated arguments from arbitrary RESP commands — blocking generic command-semantics-aware guardrails + custom command interception.
- Fragmented Figma client ecosystem (mixed cluster-awareness / TLS / connection-param permutations) needed RPC-layer shims that upstream proxies don't offer.
- Extending OSS proxies with custom business logic required maintaining a source-code fork difficult to keep in sync with upstream.
- Figma wanted internally composable extensibility — priority- aware QoS, inline encryption / compression, multi-upstream mirroring, customizable restrictions — without forking.
Results¶
- Six-nines uptime on the caching layer since H2 2025 rollout for 100% of main-API Redis traffic.
- Order-of-magnitude reduction in Redis cluster connection counts; volatility sharply reduced despite unchanged diurnal site traffic pattern.
- Incident-diagnosis time: hours/days → minutes, enabled by end-to-end metrics/logs/traces per command.
- Cross-team coordination for hardware rotations / cluster topology mods / OS upgrades / security updates downgraded from high-sev incidents to routine zero-downtime background ops.
- Figma can now formally define a caching-platform SLO and quantify aggregate Redis reliability at Figma.
Team¶
Built by the Storage Products team in collaboration with many Figma Infrastructure engineers. Named contributors: Justin Palpant, Indy Prentice, Pratik Agarwal, Yichao Zhao (primary); Devang Sampat, Mehant Baid, Alex Sosa, Ankita Shankar, Anna Saplitski, Can Berk Guder, Jim Myers, Lihao He, Manish Jain, Ping-Min Lin.
Seen in¶
- sources/2026-04-21-figma-figcache-next-generation-data-caching-platform — primary source; single-post reveal of the full architecture + rollout strategy.
Related¶
- systems/respc — the Go RESP-RPC framework inside FigCache's frontend
- systems/redis, systems/aws-elasticache — the backing stores
- patterns/caching-proxy-tier — the architectural pattern
- patterns/protocol-compatible-drop-in-proxy — what makes migration a one-line change
- patterns/starlark-configuration-dsl — how engine trees are authored
- concepts/connection-multiplexing — the central mechanism
- concepts/control-plane-data-plane-separation — FigCache config = control plane; engine-tree execution = data plane