PATTERN Cited by 1 source
Transparent chunking for large values¶
Transparent chunking is the pattern of splitting a single logical value into multiple physical chunks inside a KV / document store, keyed so that the caller never has to reason about the split — reads recombine chunks, writes are atomic across chunks. The pattern lets a KV store handle blob-scale values without either (a) exposing the chunking to callers or (b) imposing a fat-column cost on the main table.
Canonical instance: Netflix's KV Data Abstraction Layer, which transparently chunks any item larger than 1 MiB.
The problem¶
Many KV / wide-column stores have practical size limits per cell or per partition:
- Cassandra: partitions > tens of MB get slow; very large blobs blow memory during reads / compaction / repair.
- DynamoDB: hard 400 KB per-item limit.
- Most KV engines: big values on the hot path starve cache and interact poorly with pagination.
Storing a 100 MiB profile-state blob in one row produces the fat-column form of the wide- partition problem and is operationally ruinous.
The pattern¶
Two-tier split for any item over a threshold:
- Main table:
id,key,metadatastay where callers expect them — durable, indexed, participating in range scans. Thevaluefield is replaced with a reference to the chunk set (typically a chunk count + a chunk-group identifier). - Chunk store: same engine (or different) with a partitioning scheme optimized for large values; holds ordered chunks keyed by the main-table's reference + chunk index.
Writes stage the chunks first, then commit with a metadata write that names them; reads follow the reference and stream chunks back in order.
What makes it atomic: one idempotency token¶
If chunks are written independently, a mid-write failure could leave the main-table commit pointing at a half-written chunk set. The KV DAL binds chunks + main-table commit via one idempotency token — either the entire logical write lands, or the engine dedupes the partial write against the same token on retry.
"For items smaller than 1 MiB, data is stored directly in the main backing storage (e.g. Cassandra), ensuring fast and efficient access. However, for larger items, only the id, key, and metadata are stored in the primary storage, while the actual data is split into smaller chunks and stored separately in chunk storage. This chunk storage can also be Cassandra but with a different partitioning scheme optimized for handling large values. The idempotency token ties all these writes together into one atomic operation." (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)
Why transparent (not exposed to callers)¶
- Consumers keep a simple API.
PutItemstakes an item; the DAL decides whether it's chunked or not based on size. No caller cares; no caller's code path depends on which side of the 1 MiB line the item fell. - Threshold tuning stays platform-side. Netflix could raise the threshold to 2 MiB tomorrow without any microservice change.
- Latency becomes linear in size. "By splitting large items into chunks, we ensure that latency scales linearly with the size of the data, making the system both predictable and efficient." No latency cliff at the engine's fat-column threshold.
Contrast with patterns/metadata-plus-chunk-storage-stack¶
Both patterns split metadata and bytes — but at different layers:
- Transparent chunking for large values (this pattern) — intra-row split inside a KV store. Callers see one "item"; the DAL splits big values into chunks. Main and chunk stores are two tables in (usually) the same engine.
- patterns/metadata-plus-chunk-storage-stack — filesystem / volume / object-store architecture. A metadata tier (FoundationDB, SQLite, Redis) separate from a chunk tier (S3, local NVMe) at the storage-system level. Different scale, different substrate diversity, same architectural instinct.
Trade-offs¶
- Reads of large items are multi-chunk round trips — latency scales linearly; it's not zero overhead.
- Chunk-store partitioning is a separate operational concern. The chunk-store partition scheme (often differently keyed) becomes an extra bit of infrastructure to monitor.
- The threshold is operational — too low and every write pays the chunking overhead; too high and the fat-column risk remains.
- Eviction / GC of orphaned chunks — if a metadata write commits but then the row is deleted, chunks must be GC'd; not doing this cleanly causes storage bloat.
- Chunked reads interact with byte-size pagination — a paginator pulling large chunked items may exceed the byte budget mid-item, requiring the early-response path.
Seen in¶
- sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — canonical. Netflix KV DAL chunks items > 1 MiB; main table holds id/key/metadata; chunk store uses a different partitioning scheme; one idempotency token binds the atomic write.
Related¶
- systems/netflix-kv-dal — canonical instance.
- concepts/idempotency-token — the atomicity substrate.
- concepts/wide-partition-problem — the fat-column failure this pattern prevents.
- patterns/metadata-plus-chunk-storage-stack — the outer- layer, filesystem-scale sibling of the same split.
- systems/apache-cassandra — Netflix's documented backing engine for both main and chunk stores.
- patterns/data-abstraction-layer — the DAL is where this chunking discipline lives.