Skip to content

CONCEPT Cited by 1 source

BSON document overhead

Definition

BSON document overhead is the per-document + per-field bytes that MongoDB's on-the-wire / on-disk BSON binary encoding pays regardless of user data: a document length prefix, a terminating null byte, and per-field (type marker, field name, null terminator, value length or value) headers.

These bytes are invisible in a JSON.stringify() view of the same document but are real on disk and real in the WiredTiger cache. They shape the economics of every schema-design trade-off.

Per-field overhead breakdown

For a scalar field "status": 1 in a BSON document:

  • 1 byte: type tag (0x10 = int32)
  • N bytes: field name "status" = 6 bytes
  • 1 byte: field-name null terminator
  • 4 bytes: int32 value

12 bytes total for a 4-byte value. The field name alone costs more than the value it carries; the type tag + null bytes add another 2.

Field names are stored once per occurrence, not interned. Storing {status: 1, status: 2, ...} 1,000 times in an array pays the "status" name tax 1,000 times.

Per-document overhead

Each BSON document pays:

  • 4 bytes document length header
  • 1 byte trailing null terminator
  • Plus, as a stored document, the 12-byte minimum _id ObjectId field if not overridden.

~17 bytes minimum per-document regardless of content.

For a million-document collection with tiny payloads, per-document overhead dominates; for a collection of few large documents, it's negligible.

Why it shapes schema design

Three direct consequences:

  1. Favor fewer larger documents over many small ones. The concepts/bucket-pattern leans on this: collapse 100 per-event documents into one 100-event bucket and save 100 × 17 = 1,700 bytes of per-document overhead alone.
  2. Short field names at scale. "approved": 10 costs 1 + 8 + 1 + 4 = 14 bytes; "a": 10 costs 1 + 1 + 1 + 4 = 7 bytes — half. At 500 M events this compounds into gigabytes. MongoDB Cost of Not Knowing Part 1 (not yet ingested) walks through this as its first optimization.
  3. Dynamic schemas move a value into a field-name position. Instead of {date: "0605", ...} paying "date" every time, the field name is the date. The per-element "date" tax vanishes. Measured result in the case study: appV5R3 → appV6R0 document size dropped from 385 B to 125 B (67.5 %).

Compression interaction

WiredTiger's default snappy compression operates on whole pages (~32 KB blocks by default); repeated field names compress extremely well. So:

  • Uncompressed BSON overhead — what in-cache documents cost (WiredTiger cache holds uncompressed pages).
  • Compressed on-disk storage — typically 3–4× smaller than the uncompressed form for repetitive schemas. MongoDB's collStats reports both as size (uncompressed data) and storageSize (compressed).

The cache budget is about uncompressed overhead; the disk I/O + storage-cost budgets are about compressed bytes. A schema that wins on one axis can still lose on the other. See concepts/document-storage-compression.

Why document size didn't scale linearly with bucketing range

In MongoDB's case study, widening the bucketing window from month (appV6R0) to quarter (appV6R1) stored 3× the data per bucket but documents grew only ~2× in size (125 B → 264 B). The missing factor is BSON overhead amortization:

  • Per-document overhead (length prefix, _id, trailing null) is paid once per bucket regardless of how many inner elements it carries.
  • Per-field overhead inside items is paid once per encoded day — but the outer items field name itself is paid once per document.
  • Denser documents pay these fixed costs less often.

The author's initial arithmetic-based prediction (3× data → 3× document) was therefore pessimistic; measurement showed a better ratio because overhead stayed flat as density grew.

Seen in

Last updated · 200 distilled / 1,178 read