Skip to content

CONCEPT Cited by 1 source

Document storage compression

Definition

Document storage compression is the per-block compression applied by a database's storage engine before writing pages to disk; pages are decompressed when loaded back into the cache. In MongoDB's WiredTiger storage engine this is configurable per collection via storage.wiredTiger.collectionConfig.blockCompressor.

Available algorithms:

  • snappy — WiredTiger's default. Google's fast compression library; moderate ratio, very low CPU overhead. Designed for "fast over small."
  • zstd — higher ratio than snappy at higher CPU cost; added as a WiredTiger option to let workloads dominated by disk I/O (not CPU) trade CPU cycles for bytes on disk.
  • zlib — legacy, highest ratio but significantly more CPU per page; rarely chosen in modern deployments.
  • none — no compression. Every read / write is literal BSON on disk.

Why it matters for schema iteration

A schema change's on-disk footprint is not the same as its in-memory footprint, because compression operates between them. MongoDB's collStats surfaces both:

  • Data size — total uncompressed BSON bytes (in-memory form).
  • Storage size — total compressed bytes on disk.
  • Index size — uncompressed index B-tree bytes (separate budget; indexes also get compressed via WiredTiger's prefix_compression by default).

Schemas with lots of repeated structure compress better. The MongoDB Cost of Not Knowing Part 3 dynamic-schema approach — sub-documents with tiny integer counters and numeric field names — compresses very well under snappy because the repeating {a: N, n: N, p: N, r: N} shape is dictionary-friendly.

Measured from the case study (appV5R3 vs appV6R1):

Revision Data (uncompressed) Storage (compressed) Ratio
appV5R3 11.96 GB 3.24 GB ~3.7×
appV6R1 8.19 GB 2.34 GB ~3.5×
appV6R0 11.10 GB 3.33 GB ~3.3×

Trade-offs

  • Lever 1: ratio vs CPU. Snappy → zstd → zlib is a monotone progression on both axes. On a disk-bound workload (what Part-3's intro paragraph names as the remaining bottleneck from Part 2's appV5R4), swapping snappy for zstd trades CPU for bytes-read-from-disk; on a CPU-bound workload the same swap regresses.
  • Lever 2: schema vs compressor choice. A dynamic schema like the one in MongoDB's case study is a schema-level compression — it removes redundancy before WiredTiger sees it. The remaining on-disk bytes are harder to compress further because a schema-level pre-compression has already removed the easy repeats.
  • Read-path decompression cost is always paid. Pages in cache are uncompressed; compression savings apply to the disk I/O path and storage cost, not the WiredTiger cache footprint.
  • Index compression is separate. WiredTiger prefix- compresses index keys by default but the cache accounting is of uncompressed index bytes — shrinking data compression doesn't help an oversized index.

Seen in

  • sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — article's intro paragraph explicitly names "modifying the storage compression algorithm" as a Part-3 lever alongside the dynamic schema. The raw we captured covers only the dynamic-schema side (appV6R0 + appV6R1 fully; appV6R2 stats only); one or more of the missing appV6R2 / R3 / R4 revisions presumably swaps snappy → zstd or disables compression. The numeric data vs storage ratios across appV5R3 / appV6R0 / appV6R1 in the case-study tables presume WiredTiger's default snappy on all three (not explicitly named in the raw).
Last updated · 200 distilled / 1,178 read