Optimizing cloud economics with linear elastic caching¶
Summary¶
Google Research describes linear elastic caching — a cost-aware caching strategy deployed in Spanner that replaces fixed-size caches with TTL-based admission/eviction guided by the ski-rental problem from online algorithms theory. Instead of evicting based on recency (LRU) or frequency, the system assigns per-page TTLs predicted by a lightweight decision tree that considers page size, miss cost, and operation type. Production deployment across Spanner's infrastructure reduced memory usage by 15.5% with only a 5.5% increase in cache misses and ~5% reduction in total cost of ownership (TCO), with the additional misses concentrated on cheap-to-fetch pages (0.5% I/O cost impact).
Key Takeaways¶
-
Elastic caching reframes the problem: rather than "what to evict from a full cache," the question becomes "how long should this page stay?" — turning cache sizing into a per-page TTL assignment problem.
-
Ski-rental problem as theoretical foundation: the decision "should I keep paying rent (memory cost per time unit) for this cached page, or let it go (pay the miss cost next time)?" maps directly to the classical ski-rental problem, which has a known optimal competitive ratio of 2 (the breakeven strategy).
-
Cost-awareness concentrates misses on cheap pages: because the TTL prediction accounts for miss cost (the expense of re-fetching from storage), the small increase in cache misses is not uniformly distributed — it's concentrated on data that is cheap to fetch, yielding negligible I/O cost increase (0.5%) despite 5.5% more misses.
-
Lightweight ML model for billion-RPS systems: Spanner handles billions of requests per second, so the TTL prediction model must be ultra-lightweight. A shallow decision tree (translatable to a few lines of C++) was chosen over more complex models — interpretability and nanosecond-scale overhead trump marginal accuracy gains.
-
Production numbers at Spanner scale:
- Memory usage: −15.5%
- Cache misses: +5.5%
- TCO: −~5%
-
I/O cost impact: +0.5% (negligible)
-
GDSF as the fixed-cache baseline: the Greedy Dual Size Frequency (GDSF) algorithm — a generalization of LRU that accounts for variable page sizes — serves as the comparison baseline for fixed-cache-size approaches.
-
Four elastic variants tested: breakeven ski-rental, randomized ski-rental, each with and without ML-learned per-page TTLs. The ML-augmented variants train on the first half of traces and predict best TTLs per page for the second half.
-
Public trace validation: results are not Spanner-specific — experiments on publicly available CMU cache traces confirm the approach generalizes across workloads.
Systems & Concepts Extracted¶
- Systems: systems/cloud-spanner (primary deployment target)
- Concepts: concepts/elastic-caching, concepts/ski-rental-problem, concepts/cost-aware-caching, concepts/competitive-ratio, concepts/gdsf-eviction, concepts/ttl-based-cache-sizing
- Patterns: patterns/ml-predicted-ttl, patterns/linear-elastic-cache-sizing
Operational Numbers¶
| Metric | Change | Notes |
|---|---|---|
| Memory usage | −15.5% | vs fixed-size cache |
| Cache misses | +5.5% | concentrated on low-cost pages |
| TCO | −~5% | memory savings > miss penalty |
| I/O cost | +0.5% | cost-awareness shields real I/O |
| Request rate | billions/sec | Spanner production scale |
| Model | shallow decision tree | few lines of C++, interpretable |
Caveats¶
- The article appears to be the "testing & results" section of a larger piece; theoretical derivations are referenced but not reproduced here.
- Public-trace experiments couldn't use application-level features (no decision tree), so the ML-augmented results are only fully validated in the Spanner production context.
- The 5% TCO reduction is "approximately" — exact number may vary by workload mix.
Source¶
- Original: https://research.google/blog/optimizing-cloud-economics-with-linear-elastic-caching/
- Raw markdown:
raw/google/2026-06-25-optimizing-cloud-economics-with-linear-elastic-caching-4230e545.md
Related¶
- concepts/lru-cache-eviction — GDSF generalizes LRU; elastic caching replaces eviction with TTL
- concepts/cache-ttl-staleness-dilemma — elastic caching addresses the "how long?" question with cost-optimal TTLs
- concepts/competitive-ratio — ski-rental provides the 2-competitive bound that underlies breakeven caching
- systems/cloud-spanner — production deployment target