Skip to content

CONCEPT Cited by 1 source

In-memory tenant state

Definition

In-memory tenant state names the service property where tenant-specific data is loaded into RAM at process startup and served from memory, rather than fetched from a backing store on every request. The tenant's dataset lives in the service's heap; request handling is a read (or update) against in-memory structures.

Canonical wiki instance: AWS Architecture Blog's 2026-05-12 ad-serving platform (Source: sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services). Verbatim:

"Our ad-serving platform is a stateful service that loads and maintains data in memory for each tenant rather than fetching it from a database on every request. This in-memory state improves performance but creates the noisy neighbor problem when tenants share infrastructure."

Why services choose this design

Three primary reasons surface across the wiki's stateful-service canon:

  1. Request latency. Database or remote-cache lookups on the hot path add 1–10 ms per request. For services with p99 SLAs in the low-millisecond range (ad auctions, real-time personalisation, low-latency ranking), in-memory lookups are the only way to stay within budget.
  2. Query shape complexity. Some queries (nearest-neighbour search over feature vectors, graph traversal, per-tenant model inference) don't map cleanly to database primitives; the service needs the data in a specific in-memory data structure (KD-tree, adjacency list, model weights, bloom filter).
  3. Throughput. Reading from a local heap serves millions of operations per second per core; reading from a remote store hits network round-trip + serialisation costs that cap throughput orders of magnitude lower.

In the ad-serving canonical instance: "Our infrastructure handles millions of requests per second" — a throughput tier where in-memory state is structural, not optional.

The load-bearing consequence: heap becomes the tenant boundary

When a service keeps tenant state in memory, the process heap is the finest-granularity tenant boundary available. Sharing a heap across tenants means sharing:

  • Allocation pressure. One tenant's growth triggers GC that pauses every other tenant's request handlers (Java), or fragments the allocator (C++), or trips memory-compaction heuristics (Go).
  • OOM exposure. One tenant's dataset exceeding available memory triggers process-level OOM that kills every other tenant's active requests.
  • Cache effectiveness. Page-cache and CPU-cache lines are shared; one tenant's working set evicts another tenant's working set.
  • Deallocation contention. Deallocator locks, arena locks, and concurrent GC roots scale with the number of active tenants in a single process.

The canonical post makes this explicit:

"When two tenants share a cluster, their in-memory data competes for the same heap. A tenant with a large dataset can trigger out-of-memory conditions that affect its neighbors. This made shared-task and shared-cluster approaches challenging for our stateful workloads."

What this forces at the isolation layer

Services with in-memory tenant state cannot safely share a process, a JVM, or an EC2 instance's memory across tenants. This forces the isolation boundary to be at least at the cluster level — each tenant gets its own scheduler-managed cluster running on its own compute nodes. See concepts/cluster-level-tenant-isolation.

Task-per-tenant (multiple tasks on shared nodes) is still unsafe because multiple tenants' processes can co-locate on the same EC2 instance and compete for page cache + OS memory. Only cluster-per-tenant with dedicated node pools (or stronger: account-per-tenant) structurally eliminates the shared-memory blast radius.

Contrast with stateless services

Stateless services hold no tenant-specific data in memory between requests; every request fetches its working set from a backing store. The heap contains only the active request's data plus shared caches. For stateless services:

  • Tenants can share a JVM safely; GC pressure is request-local, not tenant-persistent.
  • Heap size is bounded by concurrent request count, not by tenant count or per-tenant dataset size.
  • OOM is a function of traffic shape, not tenant composition.
  • Row-level or JWT-claim-level tenant isolation is sufficient.

The structural gap between stateful and stateless services at the isolation-grain level is the isolation boundary has to move one level coarser when state is in memory.

Known patterns that use the property

  • Meta TAO — graph data cached in memory for a social-graph shard; per-shard fate-sharing rather than per-tenant.
  • systems/spann / systems/hnsw / vector search — per-customer vector indexes held in RAM for low-latency ANN lookup.
  • Per-tenant ML model serving — model weights loaded into GPU VRAM or CPU RAM at process startup; the process becomes fate-shared with the model's tenant.
  • Ad-serving — the canonical wiki instance; per-tenant campaign metadata, targeting predicates, and bidding state held in memory.
  • Rate limiters / per-tenant quotas — per-tenant counters in process memory; eviction triggers re-sync with a backing store.

What the property doesn't require

  • Durability. In-memory tenant state is typically backed by a persistent store (the canonical post mentions a "shared remote cache"); the in-memory copy is a read-through cache or a warm-started snapshot.
  • Consistency. Strong consistency is expensive across multiple stateful processes; most in-memory-state services accept eventual consistency with a bounded staleness window.
  • All data in memory. Many in-memory-state services keep only the hot path in memory; cold keys fall back to the backing store.

Anti-patterns

  • Loading all tenants' data into one shared process for performance — works until the first tenant's dataset blows up memory. The canonical post describes this as "challenging" even at 18 tenants; at scale it becomes structurally impossible.
  • Relying on OS memory limits (cgroups) alone for isolation — cgroups bound a process's total RAM, but a process-level OOM still kills the whole process including other tenants' state.
  • Treating in-memory state as a pure optimisation — once the service depends on in-memory latency, it's not easily reversible to a fetch-on-request architecture without SLA renegotiation.
  • Per-tenant heap limits inside a shared JVM — heap limits are per-process, not per-tenant; a per-tenant limit requires a per-tenant process, which is exactly cluster-level isolation.

Seen in

  • sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services — canonical wiki instance. Ad-serving platform's in-memory tenant state is the explicit driver of the shift from account-per-tenant cellular architecture to cluster-per-tenant in shared accounts. The post's "Why we needed dedicated compute" section directly names the property as the reason shared-task and shared-cluster grains were rejected.
Last updated · 542 distilled / 1,571 read