CONCEPT Cited by 1 source
Document storage compression¶
Definition¶
Document storage compression is the per-block compression
applied by a database's storage engine before writing pages to
disk; pages are decompressed when loaded back into the cache. In
MongoDB's WiredTiger storage engine
this is configurable per collection via
storage.wiredTiger.collectionConfig.blockCompressor.
Available algorithms:
- snappy — WiredTiger's default. Google's fast compression library; moderate ratio, very low CPU overhead. Designed for "fast over small."
- zstd — higher ratio than snappy at higher CPU cost; added as a WiredTiger option to let workloads dominated by disk I/O (not CPU) trade CPU cycles for bytes on disk.
- zlib — legacy, highest ratio but significantly more CPU per page; rarely chosen in modern deployments.
- none — no compression. Every read / write is literal BSON on disk.
Why it matters for schema iteration¶
A schema change's on-disk footprint is not the same as its
in-memory footprint, because compression operates between them.
MongoDB's collStats
surfaces both:
- Data size — total uncompressed BSON bytes (in-memory form).
- Storage size — total compressed bytes on disk.
- Index size — uncompressed index B-tree bytes (separate
budget; indexes also get compressed via WiredTiger's
prefix_compressionby default).
Schemas with lots of repeated structure compress better. The
MongoDB Cost of Not Knowing Part 3 dynamic-schema approach —
sub-documents with tiny integer counters and numeric field
names — compresses very well under snappy because the repeating
{a: N, n: N, p: N, r: N} shape is dictionary-friendly.
Measured from the case study (appV5R3 vs appV6R1):
| Revision | Data (uncompressed) | Storage (compressed) | Ratio |
|---|---|---|---|
| appV5R3 | 11.96 GB | 3.24 GB | ~3.7× |
| appV6R1 | 8.19 GB | 2.34 GB | ~3.5× |
| appV6R0 | 11.10 GB | 3.33 GB | ~3.3× |
Trade-offs¶
- Lever 1: ratio vs CPU. Snappy → zstd → zlib is a monotone progression on both axes. On a disk-bound workload (what Part-3's intro paragraph names as the remaining bottleneck from Part 2's appV5R4), swapping snappy for zstd trades CPU for bytes-read-from-disk; on a CPU-bound workload the same swap regresses.
- Lever 2: schema vs compressor choice. A dynamic schema like the one in MongoDB's case study is a schema-level compression — it removes redundancy before WiredTiger sees it. The remaining on-disk bytes are harder to compress further because a schema-level pre-compression has already removed the easy repeats.
- Read-path decompression cost is always paid. Pages in cache are uncompressed; compression savings apply to the disk I/O path and storage cost, not the WiredTiger cache footprint.
- Index compression is separate. WiredTiger prefix- compresses index keys by default but the cache accounting is of uncompressed index bytes — shrinking data compression doesn't help an oversized index.
Seen in¶
- sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — article's intro paragraph explicitly names "modifying the storage compression algorithm" as a Part-3 lever alongside the dynamic schema. The raw we captured covers only the dynamic-schema side (appV6R0 + appV6R1 fully; appV6R2 stats only); one or more of the missing appV6R2 / R3 / R4 revisions presumably swaps snappy → zstd or disables compression. The numeric data vs storage ratios across appV5R3 / appV6R0 / appV6R1 in the case-study tables presume WiredTiger's default snappy on all three (not explicitly named in the raw).