Skip to content

CONCEPT Cited by 1 source

Cache TTL staleness dilemma

What it is

The cache TTL staleness dilemma is the forced either/or that TTL-based caching creates for rapidly-changing data: every operator has to pick one of two uncomfortable trade-offs, and neither is safe at multi-tenant scale.

  1. Long TTL — good hit rate, low load on the upstream data source, but stale data is served for up to the TTL duration. For tenant-scoped metadata (feature flags, access boundaries, rate- limits, configuration), stale data can mean wrong tenant context or wrong authorization decision — a correctness bug, not a performance bug.
  2. Short TTL (aggressive invalidation) — fresh data, but load amplifies on the upstream source in proportion to traffic, and cache-miss storms become a fleet-wide liveness risk (concepts/thundering-herd).

At single-tenant scale, operators pick a TTL that hides the problem. At hundreds-to-thousands of tenants — each with independently-changing config — neither side of the trade-off is acceptable: long TTL means many tenants see wrong data, short TTL saturates the metadata service.

Named explicitly in the AWS multi-tenant-config post (Source: sources/2026-04-08-aws-build-a-multi-tenant-configuration-system-with-tagged-storage-patterns): "Traditional caching strategies force an uncomfortable trade-off: either accept stale tenant context (risking incorrect data isolation or feature flags), or implement aggressive cache invalidation that sacrifices performance and increases load on your metadata service."

Why it bites multi-tenant configuration specifically

  • Correctness-critical data. Tenant metadata drives authorization decisions — a stale feature flag or tenant boundary means the service misroutes or mis-authorizes. Unlike stale product-catalog data (annoying), stale tenant-isolation data is a security incident.
  • Fat tail of rarely-used tenants. Long-TTL caches evict entries for cold tenants; the occasional request for a cold tenant takes the full upstream-fetch path. Short TTLs hit every tenant.
  • Coordination with writes. The tenant-admin changing a feature flag expects immediate effect. Staleness windows measured in minutes are a visible UX bug.
  • Compound load on the metadata service. The upstream tenant- metadata service is singular; every tenant's short-TTL cache refresh hits it. Metadata-service load grows O(tenants × services × refresh_rate).

Resolutions (escape from the dilemma)

The standard escapes all move off the time-based invalidation axis:

  1. patterns/event-driven-config-refresh — reactive invalidation: data source emits a change event, a compute component receives it and pushes the fresh value to live caches. No staleness window beyond the event-delivery + push-propagation latency (seconds). No load amplification when nothing changes.
  2. patterns/stateless-invalidator (Figma LiveGraph) — same shape at a much finer grain: WAL-tail of the source DB → per- cache-replica invalidation over a pub/sub channel. The cache is correct to within a few seconds of the commit without polling.
  3. concepts/push-based-invalidation — the concept-level framing both of the above realize.
  4. Materialized per-service config embedded at deploy time — for strictly static config, skip the cache entirely; ship the value with the service binary. Trades flexibility for determinism; not viable for tenant data that changes faster than deploy cadence.
  • Cache stampede on TTL expiry (thundering herd) — N cache replicas expire the same entry at the same time, all miss simultaneously, all stampede the upstream. Mitigated by jittered TTLs, probabilistic early refresh, or single-flight coalescing — none of which eliminate the underlying trade-off.
  • Divergence across service instances — instance A has fresh data, instance B still has stale data, caller sees the coin flip. TTL-based caches guarantee this when change rate > TTL.
  • Load-amplified blackouts — a bursty metadata-write pattern can saturate a short-TTL cache's fresh-read path and collapse the upstream service, cascading to everybody.

Caveats

  • The dilemma is about the TTL-based cache shape specifically. Event-driven / push-invalidated caches dodge it structurally.
  • Application-level fallback TTL is still useful as belt-and- suspenders against silent push-path failures. Pair a long fallback TTL (e.g., 1 hour) with an event-driven refresh that normally updates in seconds.
  • Not every data class is correctness-critical. Product-catalog items happily tolerate minute-scale staleness; the dilemma is most acute for tenant-metadata, authorization-attribute, feature-flag, and rate-limit data.

Seen in

Last updated · 200 distilled / 1,178 read