CONCEPT Cited by 1 source

BSON document overhead¶

Definition¶

BSON document overhead is the per-document + per-field bytes that MongoDB's on-the-wire / on-disk BSON binary encoding pays regardless of user data: a document length prefix, a terminating null byte, and per-field (type marker, field name, null terminator, value length or value) headers.

These bytes are invisible in a JSON.stringify() view of the same document but are real on disk and real in the WiredTiger cache. They shape the economics of every schema-design trade-off.

Per-field overhead breakdown¶

For a scalar field "status": 1 in a BSON document:

1 byte: type tag (0x10 = int32)
N bytes: field name "status" = 6 bytes
1 byte: field-name null terminator
4 bytes: int32 value

→ 12 bytes total for a 4-byte value. The field name alone costs more than the value it carries; the type tag + null bytes add another 2.

Field names are stored once per occurrence, not interned. Storing {status: 1, status: 2, ...} 1,000 times in an array pays the "status" name tax 1,000 times.

Per-document overhead¶

Each BSON document pays:

4 bytes document length header
1 byte trailing null terminator
Plus, as a stored document, the 12-byte minimum _id ObjectId field if not overridden.

→ ~17 bytes minimum per-document regardless of content.

For a million-document collection with tiny payloads, per-document overhead dominates; for a collection of few large documents, it's negligible.

Why it shapes schema design¶

Three direct consequences:

Favor fewer larger documents over many small ones. The concepts/bucket-pattern leans on this: collapse 100 per-event documents into one 100-event bucket and save 100 × 17 = 1,700 bytes of per-document overhead alone.
Short field names at scale. "approved": 10 costs 1 + 8 + 1 + 4 = 14 bytes; "a": 10 costs 1 + 1 + 1 + 4 = 7 bytes — half. At 500 M events this compounds into gigabytes. MongoDB Cost of Not Knowing Part 1 (not yet ingested) walks through this as its first optimization.
Dynamic schemas move a value into a field-name position. Instead of {date: "0605", ...} paying "date" every time, the field name is the date. The per-element "date" tax vanishes. Measured result in the case study: appV5R3 → appV6R0 document size dropped from 385 B to 125 B (67.5 %).

Compression interaction¶

WiredTiger's default snappy compression operates on whole pages (~32 KB blocks by default); repeated field names compress extremely well. So:

Uncompressed BSON overhead — what in-cache documents cost (WiredTiger cache holds uncompressed pages).
Compressed on-disk storage — typically 3–4× smaller than the uncompressed form for repetitive schemas. MongoDB's collStats reports both as size (uncompressed data) and storageSize (compressed).

The cache budget is about uncompressed overhead; the disk I/O + storage-cost budgets are about compressed bytes. A schema that wins on one axis can still lose on the other. See concepts/document-storage-compression.

Why document size didn't scale linearly with bucketing range¶

In MongoDB's case study, widening the bucketing window from month (appV6R0) to quarter (appV6R1) stored 3× the data per bucket but documents grew only ~2× in size (125 B → 264 B). The missing factor is BSON overhead amortization:

Per-document overhead (length prefix, _id, trailing null) is paid once per bucket regardless of how many inner elements it carries.
Per-field overhead inside items is paid once per encoded day — but the outer items field name itself is paid once per document.
Denser documents pay these fixed costs less often.

The author's initial arithmetic-based prediction (3× data → 3× document) was therefore pessimistic; measurement showed a better ratio because overhead stayed flat as density grew.

Seen in¶

sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — document-size scaling from appV6R0 (125 B) to appV6R1 (264 B) at 3× the data is the canonical wiki illustration of BSON per- document overhead amortizing. The Part 1 / Part 2 articles in the same series also exercise this lever (field-name shortening, bucket pattern) but are not yet ingested on this wiki.