Meta's AI Storage Blueprint at Scale¶

Summary¶

Meta's Data Infrastructure team describes the complete re-architecture of their BLOB-storage stack to serve modern AI workloads — addressing two primary challenges: maximizing GPU utilization (bounded pMax latencies to prevent GPU stalls) and maximizing research velocity (letting researchers ingest data once and access it anywhere without manual cross-region data movement). The post discloses that Meta operates hundreds of exabyte-scale storage clusters built on the Tectonic block layer, and explains how the legacy multi-layer metadata architecture (acceptable for HDD-era web workloads with hundreds-of-milliseconds latency budgets) was fundamentally incompatible with AI workloads demanding millisecond-level flash access. The resolution: a ground-up rebuild collapsing metadata into a single flat schema on ZippyDB for O(1) path resolution, eliminating the dataplane proxy in favor of a fat client SDK streaming directly from Tectonic, and deploying regional BLOB-storage stacks colocated with GPUs. On top of this foundation, distributed caching (reusing Owl subsystem peers on GPU hosts, achieving 80% cache hit rate) and protocol optimizations (hedged reads, dynamic concurrency control) complete the picture. For research velocity, a tiered-cache architecture with on-demand hydration and prefetch APIs eliminated hours of data ingestion time, shifting the data-loading paradigm from "copy then train" to "train anywhere, hydrate transparently."

Key Takeaways¶

Storage bottlenecks are a primary contributor to GPU stalls — during synchronized training, one GPU slow due to storage fetch delays the entire step across hundreds of thousands of GPUs. Bounded pMax latency (not just p50) is the critical SLO for AI storage (Source: this article, "Why Latency Matters" section).
The legacy BLOB-storage architecture's organic layering was a showstopper — multiple stateful metadata layers (namelayer, volumeslayer, containerlayer) required multiple lookups per request, some crossing regions, adding up to hundreds of milliseconds. One slow lookup was sufficient to stall (Source: "Legacy BLOB-Storage Architecture Wasn't AI-Ready" section).
Unified metadata schema on ZippyDB achieves O(1) lookup — collapsing the metadata spread across different layers into one flat schema backed by ZippyDB enables path-to-storage-address resolution in a single lookup, a step-function improvement over the prior cascading lookups (Source: "Rebuilding the Foundation" section).
Fat client SDK eliminates the data proxy — the new architecture embeds a Tectonic BlockClient directly in the SDK, allowing clients to stream bytes directly from storage servers. This helps with both latency/throughput (no proxy hop) and power efficiency (Source: "Rebuilding the Foundation" section).
The ReadPlan pattern separates metadata from data — a getReadPlan() call resolves the path to (blockId, offset, size) tuples via the API server, then the SDK fetches data directly from Tectonic. This adds zero overhead on top of Tectonic (Source: Figure 3 and "Rebuilding the Foundation" section).
Distributed data cache on GPU host spare memory achieves 80% hit rate — leveraging Owl subsystem peers integrated directly into the BLOB-storage client SDK absorbs traffic spikes, solves metadata hot-shard problems, and improves p50/p99 latencies (Source: "Dealing With Spikes and Hot Spots" section).
ReadPlan metadata cache provides 1–2 ms metadata access — frequently accessed BLOB ReadPlans are cached in a distributed-memory store similar to memcache (Source: "Dealing With Spikes and Hot Spots" section).
Hedged reads mitigate tail latency from slow storage nodes — a well-understood technique applied at the client side to deal with laggard nodes (Source: "Protocol Optimizations" section).
Dynamic concurrency control on the client SDK manages egress spikes — during checkpoint events, sharp egress spikes cause congestion/timeouts/retries. The SDK automatically tunes parallelism based on application-level congestion signals (Source: "Protocol Optimizations" section).
Tiered cache architecture with on-demand hydration eliminates hours of ingestion — the data-loading paradigm shifts from explicit cross-region snapshot copies to transparent tiered caching: L1 (GPU host memory), L2 (GPU host flash), L3 (regional BLOB-storage flash), with global HDD BLOB-storage as the source of truth. Researchers ingest once and access anywhere (Source: "Maximizing Research Velocity" section, Figure 4).
Deep prefetch API enables background hydration — the SDK exposes an explicit prefetch() API that triggers data hydration from remote storage onto the local region's L3 cache and prewarms metadata cache, hiding cross-region latency (Source: "Maximizing Research Velocity" section).
Automatic data lifecycle with TTL and LRU eviction — data in the L3 regional flash tier is held for a configured period for reuse across training epochs, with capacity/quota-aware eviction policies (Source: "Maximizing Research Velocity" section).

Systems Extracted¶

Tectonic — Meta's regional, multi-tenant, horizontally scalable block-storage layer. Provides high durability/availability via erasure coding, supports media tiering (HDD + flash), and manages hot/cold/warm data placement. The foundational layer upon which BLOB-storage APIs are built.
ZippyDB — Meta's distributed key-value store used as the backing store for the new unified BLOB-storage metadata schema, enabling O(1) path-to-address resolution.
Owl — Meta's internal content-distribution system whose peer subsystem is reused within the BLOB-storage client SDK to build the distributed data cache on GPU hosts.
Meta BLOB Storage — Meta's global, infinitely scalable object-storage service built on top of Tectonic. Exposes policies for durability/availability trade-offs. This post documents its complete re-architecture for AI workloads.

Concepts Extracted¶

concepts/gpu-stall-from-storage — the failure mode where storage fetch latency exceeds the GPU's data-processing time, stalling all GPUs in a synchronized training step.
concepts/unified-metadata-schema — the architectural principle of collapsing a multi-layer metadata hierarchy into a single flat schema for O(1) lookups.
concepts/fat-client-sdk — a client library architecture that embeds storage-layer capabilities (block fetching, caching, peer coordination) directly, eliminating intermediate proxy services.
concepts/tiered-cache-as-planetary-memory — modeling distributed storage as analogous to a CPU memory hierarchy (L1/L2/L3/disk) at planet scale, with transparent hydration between layers.
concepts/on-demand-data-hydration — the practice of lazily fetching data into local caches only when accessed, rather than eagerly copying entire datasets before use.
concepts/dynamic-concurrency-control — automatically tuning client parallelism based on real-time congestion signals to prevent egress spikes from causing cascading failures.

Patterns Extracted¶

patterns/fat-client-direct-io — eliminate the dataplane proxy; embed the block-client in the SDK so data streams directly from storage servers to the application.
patterns/readplan-metadata-cache — separate metadata resolution (path → storage address) from data fetch; cache the ReadPlan in distributed memory for 1–2 ms access.
patterns/distributed-data-cache-on-gpu-hosts — leverage spare memory on GPU hosts as a distributed peer-to-peer cache for hot data, absorbing spikes and reducing I/O load on backend storage.
patterns/deep-prefetch-for-on-demand-hydration — expose an explicit prefetch API that triggers background hydration of soon-to-be-needed data from remote into local cache tiers.
patterns/hedged-reads-for-tail-latency — issue redundant read requests to multiple storage nodes; use the first response, mitigating single-node latency outliers.
patterns/dynamic-concurrency-control-for-egress — automatically throttle client-side parallelism based on application-level congestion signals during burst events (e.g., checkpoint loads).

Operational Numbers¶

Hundreds of exabyte-scale storage clusters serving all Meta products
80% average cache hit rate on the distributed data cache
1–2 ms ReadPlan metadata cache access latency
O(1) lookup per chunk in the new metadata store (down from multiple cross-region lookups adding up to hundreds of ms)
Data ingestion time reduced from hours to minutes after tiered-cache rollout

Architecture Highlights¶

Legacy → New Request Flow¶

Legacy (Figure 2): Client → API Server → namelayer lookup → volumeslayer lookup → containerlayer lookup → resolve (blockId, offset, size) → API Server proxies data from Tectonic → Client. Multiple metadata lookups, some cross-region, hundreds of ms total.

New (Figure 3): Client SDK → getReadPlan() to API Server → O(1) metadata lookup in ZippyDB → returns ReadPlanResult → SDK uses embedded BlockClient to stream directly from Tectonic. Zero dataplane overhead on top of Tectonic.

Tiered Cache Memory Hierarchy (Figure 4)¶

L1: GPU host memory (distributed peer cache via Owl)
L2: GPU host flash (local SSD)
L3: Regional BLOB-storage fabric (flash-backed)
Source of truth: Global BLOB-storage fabric (HDD-backed)

Dataloader prefetch (next batch → L1), Deep prefetch API (next few minutes → L3), Automatic lifecycle (TTL/LRU eviction from L3).

Design Tradeoffs Explicitly Named¶

Legacy assumption	AI-era reality
Modest latency needs	Predictable, bounded pMax all the way to tail
Global-by-default replication	Regional-first; AI workloads need high availability but not global metadata
Optimized for cost per byte (HDD)	IOPS demands necessitate flash; storage cost negligible vs GPU cost
Space-constrained datacenters	Power-constrained datacenters (every kW on storage = kW not on GPUs)

Caveats¶

No absolute numbers disclosed for total storage capacity, GPU fleet size, or exact latency improvements
ZippyDB schema details not specified
Owl-based cache integration specifics (consistency model, invalidation protocol) not disclosed
No discussion of failure modes in the new architecture (what happens when ZippyDB or the ReadPlan cache is unavailable)
Checkpoint write path not addressed (only read path optimizations discussed)
Future work mentions scaling to network limits and checkpoint-without-stall but provides no architecture

Source¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the GPU training substrate this storage serves
sources/2025-03-04-meta-a-case-for-qlc-ssds-in-the-data-center — the flash-media tiering strategy for Meta's storage (QLC as middle tier)
sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — fleet maintenance for the GPU clusters this storage feeds
sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — the RoCE network fabric connecting GPUs to this storage