NETFLIX 2024-09-19

Netflix — Introducing Netflix's Key-Value Data Abstraction Layer¶

Summary¶

Netflix introduces the Key-Value (KV) Data Abstraction Layer (DAL) — the most mature of several abstraction services built on top of Netflix's Data Gateway Platform. KV sits as a gRPC service in front of multiple storage engines (Cassandra, EVCache, DynamoDB, RocksDB) and exposes a two-level map data model — HashMap<String, SortedMap<Bytes, Bytes>> — with four CRUD APIs (PutItems, GetItems, DeleteItems, plus MutateItems / ScanItems for complex cases). Developers describe their workload via a namespace — the unit of logical and physical configuration — and the platform routes it to the most suitable backend(s), combining e.g. a Cassandra primary with an EVCache cache tier. The post catalogs five load-bearing design choices the KV team made to hit Netflix-scale reliability: (1) client-generated monotonic idempotency tokens (generation timestamp + 128-bit nonce) making hedged and retried writes safe on last-write-wins stores like Cassandra; clock-skew on EC2 Nitro instances measured under 1 ms, and the server rejects writes with large drift to prevent silent discards and doomstones; (2) transparent chunking of items > 1 MiB — only id, key, metadata go in the main table; the blob is split into chunks in a separately-partitioned chunk store, all tied atomically by one idempotency token; (3) client-side payload compression — 75% payload reduction observed in a Netflix Search deployment by doing it client-side rather than server-side (saves server CPU, network, disk I/O); (4) byte-size (not item-count) pagination that enables predictable single-digit- millisecond SLO on a 2 MiB page read, backed by adaptive pagination that uses observed item size (from a prior page token + a per-namespace running estimate) to tune the underlying-store limit — with an SLO-aware early response fallback that returns a partial page + token when the server projects the end-to-end deadline will be missed; (5) in-band signaling — a periodic handshake where server pushes target/max latency SLOs so the client can dynamically adjust timeouts and hedging policy, and each request carries client-capability signals (compression, chunking). The deletion path is equally Cassandra-aware: record-level and range deletes produce one tombstone; item-level deletes use TTL-based deletes with jitter (mark metadata expired, randomised TTL) so compaction catches up without load spikes, at the cost of not being an immediate physical delete. Named production consumers: streaming metadata, user profiles, messaging push registry (Pushy), and large-scale impression persistence for real-time analytics (Bulldozer).

Key takeaways¶

KV is the mature exemplar of Netflix's Data Gateway strategy: a DAL service between microservices and backing stores, not a library. "This approach led to the creation of several foundational abstraction services, the most mature of which is our Key-Value (KV) Data Abstraction Layer (DAL)." Developers "just provide their data problem rather than a database solution." The DAL pattern decouples microservices from evolving and sometimes-backward-incompatible native database APIs — a recurring tax the post names explicitly: "the tight coupling with multiple native database APIs — APIs that continually evolve and sometimes introduce backward-incompatible changes — resulted in org-wide engineering efforts to maintain and optimize our microservice's data access." See patterns/data-abstraction-layer and systems/netflix-kv-dal. (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)
Core data model: HashMap<String, SortedMap<Bytes, Bytes>>. The first level is a hashed string id (the partition key); the second level is a sorted map of bytes→bytes (the clustering column on Cassandra). This one structure covers HashMaps (id → {"" → value}), named Sets (id → {key → ""}), structured Records, and time-ordered Events. Direct Cassandra DDL given: CREATE TABLE … (id text, key blob, value blob, value_metadata blob, PRIMARY KEY (id, key)) WITH CLUSTERING ORDER BY (key <ASC|DESC>). See concepts/two-level-map-kv-model.
A namespace is the unit of logical AND physical config — multiple backends per namespace. "A namespace defines where and how data is stored, providing logical and physical separation while abstracting the underlying storage systems. It also serves as central configuration of access patterns such as consistency or latency targets." Example ngsegment config: Cassandra cluster (cassandra_kv_ngsegment) as PRIMARY_STORAGE with consistency_scope: LOCAL + consistency_target: READ_YOUR_WRITES, plus an EVCache cache tier with a default_cache_ttl of 180s — durable persistence + low-latency point reads in one logical namespace. See patterns/namespace-backed-storage-routing and concepts/database-agnostic-abstraction.
Four CRUD APIs, with a rich Predicate / Selection shape for reads. PutItems is upsert + idempotency-token-protected; chunked writes are staged then committed with metadata. GetItems takes a Predicate (match_all / match_keys / match_range) + a Selection (page_size_bytes, item_limit, include/exclude to trim large values from responses) and returns (items, next_page_token). DeleteItems reuses the same Predicate — record-level, range, or item-level. Plus MutateItems and ScanItems for multi-item/multi-record ops (architecture detail deferred to a future post).
Client-generated monotonic idempotency tokens make hedged and retried writes safe on last-write-wins stores. "In the Key-Value abstraction, idempotency tokens contain a generation timestamp and random nonce token. Either or both may be required by backing storage engines to de-duplicate mutations." Netflix prefers client-generated tokens over server-generated because of network-delay effects on server-side token ordering. Safety relies on small clock skew: "our tests on EC2 Nitro instances show drift is minimal (under 1 millisecond)." KV rejects writes whose token has large drift — preventing both silent- discard-in-the-past and immutable-doomstone-in-the-future failure modes on stores that are sensitive to future timestamps. When stronger ordering is needed, Zookeeper-issued regional tokens or globally-unique transaction IDs substitute. See concepts/idempotency-token. (Clock-skew doc referenced.)
Transparent chunking: items > 1 MiB split; only id/key/metadata stay in the main table; chunk store has a different partitioning scheme optimised for large values; one idempotency token binds the writes atomically. "For items smaller than 1 MiB, data is stored directly in the main backing storage … for larger items, only the id, key, and metadata are stored in the primary storage, while the actual data is split into smaller chunks and stored separately in chunk storage." "By splitting large items into chunks, we ensure that latency scales linearly with the size of the data, making the system both predictable and efficient." A detailed architecture post is promised separately. See patterns/transparent-chunking-large-values.
Pagination is byte-size-budgeted, with adaptive tuning and SLO-aware early response. Byte-budget pagination gives predictable SLOs: "we can provide a single-digit millisecond SLO on a 2 MiB page read." But Cassandra/DynamoDB paginate by row count, so KV queries with a static limit, processes, and either discards excess (big items) or issues more queries (small items) — read amplification in both directions. Adaptive pagination caches observed item-size in the page token + a per-namespace rolling estimate, feeding back to the initial-page static limit. See concepts/byte-size-pagination and concepts/adaptive-pagination.
SLO-aware early response: if the server projects it will miss the request's latency SLO while filling the page, it stops and returns what it has + a page token. "If it determines that continuing to retrieve more data might breach the SLO, the server will stop processing further results and return a response with a pagination token." Combined with gRPC deadlines on the client side, the client "is smart enough not to issue further requests, reducing useless work." See patterns/slo-aware-early-response.
Client-side compression cut payloads 75% on a Netflix Search deployment. "In one of our deployments, which helps power Netflix's search, enabling client-side compression reduced payload sizes by 75%, significantly improving cost efficiency." Client-side rather than server-side is deliberate: "reduces expensive server CPU usage, network bandwidth, and disk I/O." See concepts/client-side-compression.
In-band signaling (periodic handshake) replaces static config with live server/client capability exchange. Server signals: target/max latency SLOs (so clients tune timeouts + hedging policy). Client signals (per-request): capability flags — compression, chunking, etc. "Without signaling, the client would need static configuration — requiring a redeployment for each change — or, with dynamic configuration, would require coordination with the client team." See concepts/in-band-signaling-handshake.
Cassandra-specific tombstone discipline: record + range deletes emit one tombstone; item-level deletes use TTL + jitter. "Key-Value optimizes both record and range deletes to generate a single tombstone for the operation." For item-level deletes (which Cassandra struggles with at high volume due to tombstone
- compaction overhead), KV writes TTL-based deletes with jitter — the metadata is flagged expired with a randomised TTL so compaction work staggers rather than spikes. This "reduces load spikes and helps maintain consistent performance while compaction catches up," at the cost of delaying true physical deletion. See concepts/ttl-based-deletion-with-jitter and concepts/tombstone.
Named production consumers span Netflix's hottest paths. Streaming metadata (high-throughput, low-latency); User profiles (preferences + history); Messaging / push registry (persists Pushy's push-delivery state); Real-time analytics impressions (backs the Bulldozer warehouse→online movement path).
The DAL cleanly encodes storage-engine complexity on Netflix's behalf so app engineers focus on data problems, not consistency models. "It abstracts the complexity of the underlying databases from our developers, which enables our application engineers to focus on solving business problems instead of becoming experts in every storage engine and their distributed consistency models." Future-work list names four extensions: lifecycle management, record summarisation (fewer backing rows for records with many items), new storage engine integrations, and dictionary compression.

Architecture at a glance¶

┌──────────────────── Netflix microservice clients ───────────────────┐
│  Capability signals per-request: compression?  chunking?  ...       │
│  Periodic handshake: server pushes target/max latency SLO ─────────┐│
└──────────────────────────────┬─────────────────────────────────────┘│
                               │ gRPC (PutItems / GetItems /          │
                               │        DeleteItems / MutateItems /   │
                               │        ScanItems)                    │
                               ▼                                      │
┌───────────────── KV Data Abstraction Layer (DAL) ───────────────────┤
│  Namespace config (logical + physical) ─────────────────────────────┤
│    e.g. `ngsegment`:                                                │
│      PRIMARY_STORAGE  = Cassandra cluster cassandra_kv_ngsegment    │
│          consistency_scope = LOCAL                                  │
│          consistency_target = READ_YOUR_WRITES                     │
│      CACHE           = EVCache evcache_kv_ngsegment                 │
│          default_cache_ttl = 180s                                   │
│                                                                     │
│  Four CRUD APIs over HashMap<String, SortedMap<Bytes, Bytes>>       │
│    Put: idempotency-token-protected upsert; chunks > 1 MiB          │
│    Get: Predicate (match_all|keys|range) + Selection                │
│         (page_size_bytes, item_limit, include/exclude)              │
│         → adaptive byte-budget pagination                           │
│         → SLO-aware early response with next_page_token             │
│    Del: record / range = 1 tombstone;                               │
│         item = TTL-with-jitter mark-expired                         │
│                                                                     │
│  Idempotency token = (generation_time, 128-bit random token);       │
│    client-generated monotonic; large-drift rejected                 │
└──────────┬──────────────┬─────────────┬──────────────┬──────────────┘
           │              │             │              │
           ▼              ▼             ▼              ▼
    ┌──────────┐   ┌──────────┐  ┌──────────┐  ┌──────────┐
    │Cassandra │   │ EVCache  │  │ DynamoDB │  │ RocksDB  │
    │ primary+ │   │  cache   │  │          │  │          │
    │ chunk    │   │  tier    │  │          │  │          │
    │ stores   │   │          │  │          │  │          │
    └──────────┘   └──────────┘  └──────────┘  └──────────┘

Chunk store uses a different partitioning scheme than the main table,
optimised for large-value access; one idempotency token binds all
chunk writes + main-table commit into one atomic logical write.

Operational numbers¶

Metric	Value	Context
Large-item chunking threshold	1 MiB	Below: stored in main backing storage. Above: id/key/metadata in main table, body split into chunks in chunk store.
Predictable SLO anchor	single-digit ms for a 2 MiB page read	Byte-budget pagination makes this the explicit SLO target KV engineers ship against.
Clock skew on EC2 Nitro	under 1 ms	Measured on the Cassandra fleet; the empirical basis for safe client-generated monotonic idempotency tokens.
EVCache cache TTL (example)	180 s	`ngsegment` namespace default in the post's config example.
Client-side compression payload reduction	75%	One Netflix Search deployment.
Idempotency-token random component	128-bit UUID	Paired with `generation_time` timestamp.
Named production consumers	≥ 4	Streaming metadata, user profiles, messaging push registry (Pushy), impression persistence (Bulldozer).

Caveats¶

Architecture-overview voice. No cluster counts, fleet-wide QPS, write-path p99, chunk-store CPU numbers, or EVCache hit ratio breakdowns. The post is a high-level tour of the KV service, not a production-telemetry report. Future posts are promised on chunking architecture internals and on MutateItems / ScanItems.
Only one concrete namespace config shown (ngsegment Cassandra + EVCache). No examples of DynamoDB- or RocksDB-backed namespaces, nor cross-region replication configs. The mechanism is described; the deployment topology variety is not.
The 75% compression number is anchored to one deployment (Netflix Search). Post does not claim it generalises. Different payload shapes (binary media, ML tensors) will see different ratios.
Clock-skew-safety of client-generated tokens is asserted for Netflix's EC2 Nitro fleet. Under-1ms drift is a cloud-VM property — a team outside EC2 Nitro (or on a VM class with worse clocks) would need to re-measure before adopting this pattern. Netflix acknowledges the substitute path (Zookeeper-issued regional tokens, globally-unique transaction IDs).
TTL-jitter-mitigation is honest about being a mitigation, not a fix. "While this doesn't completely solve the problem it reduces load spikes and helps maintain consistent performance while compaction catches up." Item-level deletes on Cassandra still generate tombstones the compactor must eventually reconcile.
MutateItems / ScanItems internals deferred. "These complex APIs require careful consideration to ensure predictable linear low-latency and we will share details on their implementation in a future post."
No mention of cross-region writes or conflict resolution semantics. The consistency_target: READ_YOUR_WRITES implies LOCAL-scope Cassandra; the global-deployment implications (e.g. active-active, CRDT-style conflict resolution, or primary-region routing) are not covered. Future-work list mentions lifecycle management but not replication topology.
Adaptive pagination's "recent query patterns or other factors" tuning loop is not specified — no exponential-average time constant, no cold-start fallback, no per-tenant isolation of the namespace rolling estimate. The mechanism is named; the dial is not.
No performance comparison vs direct-Cassandra-driver access. The gRPC DAL hop adds serialization + network latency vs a library-client talking directly to the driver. The post argues the operational + evolution-cost wins pay for it, but does not quantify the per-RPC overhead.
Signaling describes the mechanism but not the message catalog. Which specific signals exist today, their schemas, and the cadence/TTL of handshake metadata are not disclosed.

Source¶

Original: https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30
Raw markdown: raw/netflix/2024-09-19-netflixs-key-value-data-abstraction-layer-01c37ffa.md
HN discussion: news.ycombinator.com/item?id=41587461 (93 points)
Related prior Netflix posts (linked from the original): Data Gateway Platform overview · How Netflix Ensures Highly-Reliable Online Stateful Systems (InfoQ) · Pushy WebSocket proxy · Bulldozer: batch → online KV