SYSTEM Cited by 2 sources
Netflix KV Data Abstraction Layer (KV DAL)¶
Netflix KV DAL is the mature, production-exemplar Data Abstraction Layer built on Netflix's Data Gateway Platform. It is a gRPC service that sits between Netflix microservices and multiple storage engines (Cassandra, EVCache, DynamoDB, RocksDB) and exposes a uniform two-level-map data model over four CRUD APIs. It powers streaming metadata, user profiles, Netflix's push-messaging registry (Pushy), and large-scale impression persistence (Bulldozer). (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)
Problem it solves¶
Netflix ran many microservices against many online key-value and wide-column stores (Cassandra, EVCache, DynamoDB, RocksDB) and kept paying three recurring taxes:
- Developers mis-reasoning about consistency/durability/perf across stores deployed globally.
- Same data-access pitfalls re-learned per-team — tail latency, idempotency, wide partitions, fat columns, slow pagination.
- Tight coupling with evolving native DB APIs — backward- incompatible client-library changes dragged every microservice with them.
KV DAL centralizes the solutions in one service: developers describe a data problem via a namespace; the service routes to the right backend(s), handles idempotency, pagination, compression, large values, and tombstones.
Data model¶
HashMap<String, SortedMap<Bytes, Bytes>>:
- Level 1: hashed string
id(partition key on Cassandra). - Level 2: sorted
key → valuemap of bytes (clustering column on Cassandra).
One structure covers HashMaps (id → {"" → value}), Sets
(id → {key → ""}), Records, and Events. See
concepts/two-level-map-kv-model.
Canonical Cassandra DDL:
CREATE TABLE IF NOT EXISTS <ns>.<table> (
id text,
key blob,
value blob,
value_metadata blob,
PRIMARY KEY (id, key))
WITH CLUSTERING ORDER BY (key <ASC|DESC>)
Items carry explicit chunk and metadata fields:
Namespace: one unit of logical + physical config¶
A namespace bundles:
- Where data lives (which clusters / caches / regions).
- Access-pattern config —
consistency_scope,consistency_target, cache TTLs, SLO targets. - Which backends serve which tier (primary, cache, chunk).
Example from the post (ngsegment):
persistence_configuration:
- id: PRIMARY_STORAGE
physical_storage:
type: CASSANDRA
cluster: cassandra_kv_ngsegment
dataset: ngsegment
table: ngsegment
regions: [us-east-1]
config:
consistency_scope: LOCAL
consistency_target: READ_YOUR_WRITES
- id: CACHE
physical_storage:
type: CACHE
cluster: evcache_kv_ngsegment
config:
default_cache_ttl: 180s
Durable Cassandra primary + EVCache cache tier in one namespace. Developers pick the namespace; the routing + combination logic lives in KV DAL. See patterns/namespace-backed-storage-routing.
Four CRUD APIs¶
PutItems — upsert with idempotency¶
message PutItemRequest (
IdempotencyToken idempotency_token,
string namespace,
string id,
List<Item> items
)
Idempotency token = (generation_time, 128-bit token). Chunked
payloads are staged then committed with metadata (number of chunks)
— all tied atomically by one token. See
concepts/idempotency-token and
patterns/transparent-chunking-large-values.
GetItems — Predicate + Selection¶
message GetItemsRequest (
String namespace,
String id,
Predicate predicate,
Selection selection,
Map<String, Struct> signals
)
- Predicate:
match_all/match_keys/match_range. - Selection:
page_size_bytes,item_limit,include/exclude(trim large values from response). - Signals: per-request capability flags (compression, chunking).
Response: (items, next_page_token) with
byte-budgeted + adaptive
pagination, and SLO-aware
early response when the server projects it will miss the request
deadline.
DeleteItems — same Predicate shape as reads¶
message DeleteItemsRequest (
IdempotencyToken idempotency_token,
String namespace,
String id,
Predicate predicate
)
- Record-level (
match_all): one tombstone, constant latency regardless of item count. - Range (
match_range): one tombstone for the range. - Item-level (
match_keys): would normally create many tombstones; KV DAL instead marks metadata expired with a randomised TTL — see concepts/ttl-based-deletion-with-jitter — so compaction work staggers rather than spikes.
MutateItems / ScanItems — deferred¶
Complex multi-item / multi-record ops. Architecture detail promised in a future post.
Reliability primitives¶
Idempotency-token-protected writes¶
Tokens are client-generated monotonic: generation_time (client
clock) + 128-bit UUID. Rationale: "client-generated monotonic
tokens are preferred due to their reliability, especially in
environments where network delays could impact server-side token
generation." Safety relies on measured sub-1 ms clock skew on
Netflix's EC2 Nitro fleet. KV rejects writes whose tokens are far
in the past (silent-discard prevention) or far in the future
(doomstone prevention). Stronger-ordering use cases substitute
Zookeeper-issued regional tokens or globally-unique transaction IDs.
This is what makes retries / hedging / parallel speculative requests safe on last-write-wins stores like Cassandra — see concepts/tail-latency-at-scale.
Transparent chunking (> 1 MiB threshold)¶
- Items ≤ 1 MiB → main table.
- Items > 1 MiB → id/key/metadata in main table; body split into chunks in a separately-partitioned chunk store (which may itself be Cassandra, but with a partitioning scheme tuned for large values).
- One idempotency token binds chunk writes + main-table commit into one atomic logical write.
Result: "latency scales linearly with the size of the data." See patterns/transparent-chunking-large-values.
Client-side compression¶
Moved from server to client: saves server CPU + network + disk I/O. One deployment (Netflix Search) reported 75% payload reduction. See concepts/client-side-compression.
Byte-size + adaptive pagination¶
- Page limit is bytes not items → predictable SLO ("single- digit millisecond SLO on a 2 MiB page read").
- Backing stores (Cassandra, DynamoDB) limit by row count — KV queries a static number of rows, processes, discards excess (big items) or issues more queries (small items).
- Adaptive: page-token carries observed item-size; server also keeps a per-namespace rolling estimate for initial-page tuning.
See concepts/byte-size-pagination and concepts/adaptive-pagination.
SLO-aware early response¶
If the server projects retrieving enough data to fill the page will miss the request's end-to-end latency SLO, it stops early and returns what it has + a continuation token. Combined with client- side gRPC deadlines this prevents useless downstream work. See patterns/slo-aware-early-response.
In-band signaling (periodic handshake)¶
- Server-side signals (on handshake + periodic refresh): target / max latency SLOs → client dynamically tunes timeouts + hedging.
- Client-side signals (per-request): capability flags — does this client support compression, chunking, etc.
Eliminates the static-config + coordinated-redeploy cost of evolving client/server behaviour independently. See concepts/in-band-signaling-handshake.
Named production consumers¶
- Streaming metadata — high-throughput, low-latency for real-time personalised delivery.
- User profiles — preferences + history across devices.
- Messaging push registry — backs Netflix's Pushy WebSocket proxy.
- Real-time analytics impressions — integrates with the Bulldozer batch→online pipeline.
Future work (named in the post)¶
- Lifecycle management — fine-grained retention / deletion.
- Summarisation — collapse records with many items into fewer backing rows.
- New storage engines — integration to support more use cases.
- Dictionary compression — further payload reduction while preserving performance.
Documented downstream consumer use cases¶
Beyond the platform-launch post, the first wiki-documented KVDAL consumer case study is:
- Netflix Druid interval-
aware cache (2026-04-06) — KVDAL two-level-map as the store
for per-granularity-aligned-
bucket cached query fragments in front of
Apache Druid. Exercises two KVDAL
features as load-bearing architecture:
- Independent TTLs on each inner key-value pair — used for the age-based exponential TTL ladder (5 s for buckets <2 min old, doubling per minute, capped at 1 hour); eliminates manual eviction logic.
- Efficient range queries over inner keys — "give me all cached buckets between A and B for query hash X"; directly answers the cache's contiguous-prefix lookup. Production: 82% of dashboard queries get ≥partial hit, 84% of result data from cache, P90 ~5.5 ms, ~33% drop in queries to Druid, ~66% P90 improvement (Source: sources/2026-04-06-netflix-stop-answering-the-same-question-twice-interval-aware-caching-for-druid).
Seen in¶
- sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — introducing KV DAL; data model; namespace config; CRUD APIs with idempotency + chunking + pagination + SLO-aware early response + signaling; tombstone + TTL-jitter deletion discipline; named production consumers; future-work list.
- sources/2026-04-06-netflix-stop-answering-the-same-question-twice-interval-aware-caching-for-druid — first wiki-documented KVDAL downstream consumer use case; exercises the independent-per-inner-key TTL feature and range-scan over inner keys as the two load-bearing KVDAL capabilities for Netflix's interval-aware Druid cache.
Related¶
- systems/netflix-data-gateway — the parent platform on which KV DAL is the most mature abstraction service.
- systems/apache-cassandra · systems/evcache · systems/dynamodb · systems/rocksdb — backing engines KV can route to.
- patterns/data-abstraction-layer · patterns/namespace-backed-storage-routing · patterns/transparent-chunking-large-values · patterns/slo-aware-early-response
- concepts/two-level-map-kv-model · concepts/database-agnostic-abstraction · concepts/idempotency-token · concepts/byte-size-pagination · concepts/adaptive-pagination · concepts/client-side-compression · concepts/in-band-signaling-handshake · concepts/ttl-based-deletion-with-jitter
- companies/netflix