GOOGLE

Optimizing cloud economics with linear elastic caching¶

Summary¶

Google Research describes linear elastic caching — a cost-aware caching strategy deployed in Spanner that replaces fixed-size caches with TTL-based admission/eviction guided by the ski-rental problem from online algorithms theory. Instead of evicting based on recency (LRU) or frequency, the system assigns per-page TTLs predicted by a lightweight decision tree that considers page size, miss cost, and operation type. Production deployment across Spanner's infrastructure reduced memory usage by 15.5% with only a 5.5% increase in cache misses and ~5% reduction in total cost of ownership (TCO), with the additional misses concentrated on cheap-to-fetch pages (0.5% I/O cost impact).

Key Takeaways¶

Elastic caching reframes the problem: rather than "what to evict from a full cache," the question becomes "how long should this page stay?" — turning cache sizing into a per-page TTL assignment problem.
Ski-rental problem as theoretical foundation: the decision "should I keep paying rent (memory cost per time unit) for this cached page, or let it go (pay the miss cost next time)?" maps directly to the classical ski-rental problem, which has a known optimal competitive ratio of 2 (the breakeven strategy).
Cost-awareness concentrates misses on cheap pages: because the TTL prediction accounts for miss cost (the expense of re-fetching from storage), the small increase in cache misses is not uniformly distributed — it's concentrated on data that is cheap to fetch, yielding negligible I/O cost increase (0.5%) despite 5.5% more misses.
Lightweight ML model for billion-RPS systems: Spanner handles billions of requests per second, so the TTL prediction model must be ultra-lightweight. A shallow decision tree (translatable to a few lines of C++) was chosen over more complex models — interpretability and nanosecond-scale overhead trump marginal accuracy gains.
Production numbers at Spanner scale:
Memory usage: −15.5%
Cache misses: +5.5%
TCO: −~5%
I/O cost impact: +0.5% (negligible)
GDSF as the fixed-cache baseline: the Greedy Dual Size Frequency (GDSF) algorithm — a generalization of LRU that accounts for variable page sizes — serves as the comparison baseline for fixed-cache-size approaches.
Four elastic variants tested: breakeven ski-rental, randomized ski-rental, each with and without ML-learned per-page TTLs. The ML-augmented variants train on the first half of traces and predict best TTLs per page for the second half.
Public trace validation: results are not Spanner-specific — experiments on publicly available CMU cache traces confirm the approach generalizes across workloads.

Systems & Concepts Extracted¶

Systems: systems/cloud-spanner (primary deployment target)
Concepts: concepts/elastic-caching, concepts/ski-rental-problem, concepts/cost-aware-caching, concepts/competitive-ratio, concepts/gdsf-eviction, concepts/ttl-based-cache-sizing
Patterns: patterns/ml-predicted-ttl, patterns/linear-elastic-cache-sizing

Operational Numbers¶

Metric	Change	Notes
Memory usage	−15.5%	vs fixed-size cache
Cache misses	+5.5%	concentrated on low-cost pages
TCO	−~5%	memory savings > miss penalty
I/O cost	+0.5%	cost-awareness shields real I/O
Request rate	billions/sec	Spanner production scale
Model	shallow decision tree	few lines of C++, interpretable

Caveats¶

The article appears to be the "testing & results" section of a larger piece; theoretical derivations are referenced but not reproduced here.
Public-trace experiments couldn't use application-level features (no decision tree), so the ML-augmented results are only fully validated in the Spanner production context.
The 5% TCO reduction is "approximately" — exact number may vary by workload mix.

Source¶

concepts/lru-cache-eviction — GDSF generalizes LRU; elastic caching replaces eviction with TTL
concepts/cache-ttl-staleness-dilemma — elastic caching addresses the "how long?" question with cost-optimal TTLs
concepts/competitive-ratio — ski-rental provides the 2-competitive bound that underlies breakeven caching
systems/cloud-spanner — production deployment target