Skip to content

SYSTEM Cited by 7 sources

Redis

Redis is an in-memory data-structure store — key-value on the outside, but with first-class server-side support for lists, hashes, sorted sets, streams, and pub/sub. Persistence is optional (RDB snapshots + append-only log). Typically deployed as a cache, a fast serving tier for precomputed artifacts, a lightweight message broker, or a rate-limiter / counter store. Managed offerings (AWS ElastiCache, Google Memorystore, Redis Cloud) are the common deployment in production.

Properties relevant to system design

  • Single-threaded command execution on the primary (client-facing) instance — atomicity for single commands; no in-process locking.
  • Sub-millisecond in-memory reads when the dataset fits in RAM.
  • Replication + cluster sharding for scale; read replicas for read fan-out.
  • TTL on keys for cache-with-expiry as a first-class primitive.
  • Not a source of truth. Durability is best-effort; treat as a cache / derived read model and keep the authoritative copy elsewhere.

Seen in

  • sources/2026-06-02-redpanda-how-omninode-uses-redpanda-to-scale-ai-agent-workflows — 2026-06-02 OmniNode → Redpanda migration disclosure (guest post on Redpanda Blog by founder Jonah Gray). Canonicalises Redis Streams' scale ceiling at the 5 → 12 repos / 100+ event types crossing point: "we outgrew Redis Streams not because of throughput, but because coordination itself became difficult." The trigger was the five Kafka-shaped capabilities Redis Streams doesn't offer at the same fidelity — consumer groups, partition-level parallelism, durable replay semantics, topic introspection, programmatic provisioning. Disclosed migration path: XADD/XREADGROUP behind a transport-layer abstraction → Redpanda. "Apache Kafka was explicitly deferred in the roadmap because the system was still small."

  • sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inferenceVoyage AI by MongoDB uses Redis as the queue substrate for token- count-based batching on the query side of embedding inference. Each request is enqueued on a Redis list with an attached token_count; model servers run an atomic Lua script that pops items from the list until the total token count reaches the model-and-hardware-specific optimal batch size (~600 tokens for voyage-3 on A100), and sets per-item TTLs in the same single atomic call. Redis's single-threaded script execution guarantees no two model-server workers race on the same items. Canonical wiki instance of patterns/atomic-conditional-batch-claim. Caveat: "the probability of Redis losing data is very low. In the rare case that it does happen, users may receive 503 Service Unavailable errors and can simply retry" — Redis chosen specifically for its atomic-peek-and-claim Lua primitive, trading durability for the batching primitive RabbitMQ / Kafka don't natively offer. Enables Voyage AI's 50 % GPU- inference-latency reduction with 3× fewer GPUs on voyage-3-large.

  • sources/2024-12-10-canva-routing-print-orders — Canva Print Routing stores per-destination-region precomputed routing graphs in ElastiCache/Redis. 6 ms retrieval in most regions, 20 ms for largest; 99.999% availability (with read replicas). The routing graphs are async-rebuilt from a relational source of truth, so a Redis outage can be recovered from without data loss — the authority lives in the relational store.
  • sources/2026-04-21-figma-figcache-next-generation-data-caching-platformFigma FigCache fronts a fleet of ElastiCache Redis clusters with an in-house RESP-wire-protocol proxy. Context: at Figma scale, Redis evolved from a non-critical component into a critical-path dependency and its connection limits became load-bearing. Rapid client-fleet scale-ups triggered thundering herds of new connections that bottlenecked Redis I/O and degraded availability. Also: Redis Cluster's CROSSSLOT error on multi-key pipelines across hash slots is an application-visible footgun; FigCache's fanout engine transparently resolves read-only cases as parallel scatter-gather. Post-FigCache rollout, connection counts on Redis clusters dropped by an order of magnitude across the board and became much less volatile despite unchanged diurnal traffic patterns; node failovers / cluster scaling / transient connectivity errors were downgraded from high-sev incidents to zero-downtime background events. Shard failovers now run liberally and frequently across Figma's entire Redis footprint as live resiliency exercises.
  • sources/2023-07-16-highscalability-gossip-protocol-explainedRedis Cluster named as a canonical production gossip-protocol deployment: "Redis cluster uses the gossip protocol to propagate the node metadata." Redis Cluster's cluster bus is the gossip channel — each node periodically exchanges pings/pongs with random peers carrying slot-ownership, epoch, and failure- state information. Third-party explainer-level citation; useful as the definitional pointer to Redis Cluster's distributed-membership layer for readers coming from the gossip-protocol concept page.
  • sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolutionfork-aware reference: Lyft's Feature Store names its write-through LRU cache as ValKey, not Redis. Relevant to the Redis page because (a) the post-2024 Redis-Ltd. license change drove the industry fork, and (b) production usage patterns carry over one-to-one between Redis and ValKey at the protocol / data-structure level — Lyft is a data point on the Redis-API surface becoming the stable primitive while the underlying implementation bifurcates.
  • sources/2025-07-08-planetscale-caching — Ben Dicken names Redis (alongside Memcached and CloudFront) as the canonical in-memory recent-content cache sitting in front of slow object storage like S3: "These websites store much of their content (email, images, documents, videos, etc) in 'slow' storage (like Amazon S3 or similar), but cache recent content in faster, in-memory stores (like CloudFront, Redis or Memcached)." Canonical wiki positioning of Redis as the application-tier recency-bias cache, complementary to the per-company production deployments covered by the other Seen-in entries (Figma FigCache, Voyage AI batching, Canva print routing, Lyft Feature Store).
  • sources/2026-04-28-expedia-expedias-service-telemetry-analyzerExpedia STAR uses Redis as both Celery broker and result backend for its async RCA workflow queue: "we moved to Celery with Redis acting as the broker and result backend to store the state and results of tasks." Canonical wiki instance of Redis as the request-response-async queue substrate (as opposed to a streaming platform like Kafka — STAR explicitly rejects Kafka because the traffic shape is request-response). Extends Redis's catalogued roles on the wiki (cache, recency-bias store, atomic-claim batching, gossip substrate) with the Celery broker + result backend role for LLM-era async task pipelines.

Failure modes at scale (Figma FigCache retrospective)

The FigCache rearchitecture documents Redis failure modes worth naming:

  • Connection-volume saturation. Even before reaching Redis's hard connection limit, growing fleet-wide connection counts degrade I/O throughput and increase tail latency.
  • Thundering-herd on scale-up. Elastic client fleets open many new TCP+TLS connections simultaneously; the handshake burst bottlenecks Redis for existing clients.
  • Client-ecosystem fragmentation. Different client libraries have inconsistent Redis Cluster awareness, retry/timeout behavior, and observability — making fleet-wide guarantees about client- side state correctness during failovers impossible.

The canonical remedy is a stateless proxy tier in front of Redis that performs concepts/connection-multiplexing. See systems/figcache.

Last updated · 542 distilled / 1,571 read