Skip to content

SYSTEM Cited by 1 source

Pinterest memcached fleet

Overview

Pinterest's production memcached fleet circa 2022: over 5,000 EC2 instances across heterogeneous instance types, serving ~180 million requests/second at ~220 GB/s network throughput over a ~460 TB active dataset partitioned into ~70 distinct clusters — the backbone of Pinterest's read path. Source: Pinterest Engineering — Improving Distributed Caching Performance and Efficiency, summarized in sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022.

Scale snapshot (2022)

Metric Value
Instances 5,000+ EC2
Requests / second ~180M
Network throughput ~220 GB/s
Active dataset ~460 TB
Clusters ~70
Workload mix ~50% compute-bound, ~50% memory/storage-bound

Key efficiency wins

1. SCHED_FIFO real-time scheduling

Pinterest ran memcached under SCHED_FIFO with high priority — effectively letting memcached monopolize the CPU by preempting other processes. Result:

"client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board."

Load-bearing detail: the cache server is already the latency- floor of the read path; giving it scheduler priority over housekeeping daemons and noisy neighbours removes tail-latency jitter.

2. TCP Fast Open (TFO)

TFO optimizes away one RTT in the standard 3-way handshake and allows early-data transmission during the handshake. On a 180M req/s fleet with the majority of connections short-lived, this is material latency win per connection.

3. extstore for NVMe-backed memcached

Memcached's extstore feature extends the backing store from DRAM into local NVMe flash:

"using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally."

Hot objects remain in DRAM; cold objects live on NVMe; access pattern determines tier. Particularly valuable for workloads with long-tail cache hit shapes where DRAM-only would force eviction of still-useful items.

Workload diversity drives instance heterogeneity

  • Compute-bound clusters (~50% of fleet): pure request- throughput-bound. Optimization focus on per-core efficiency; scaling out horizontally addresses throughput bottlenecks.
  • Memory-bound clusters: large working-set, DRAM capacity is the binding constraint.
  • Storage-bound clusters: NVMe IOPS / capacity matter most, extstore is a fit.

"While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster's bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads."

Evaluation methodology

"Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations." Shadow-traffic evaluation ensures the optimization hypotheses translate to production workload shapes rather than benchmark-only wins.

Last updated · 517 distilled / 1,221 read