Skip to content

CONCEPT Cited by 1 source

Bucket Pattern

Definition

Bucket Pattern — MongoDB's named schema-design pattern where many fine-grained events (one event = one would-be document) are grouped into a single bucket document by a shared key + a time window (day / month / quarter / year). Instead of 500 M event documents, the collection holds ~one bucket document per key per window, each bucket carrying an internal array (or, in the dynamic-schema variation, sub-document) of the events that fell inside its window.

The trade-off rebalances storage, indexing, and write amplification:

  • Fewer documents ⇒ smaller index. A per-event collection with 500 M docs needs 500 M _id entries; a bucketed collection with 33 M quarter-buckets needs ~15× fewer. systems/mongodb-server indexes every document in the _id B-tree; shrinking the index is a direct lever on WiredTiger cache pressure.
  • Denser documents ⇒ better BSON overhead amortization. Per-document overhead (field name headers, length prefixes, _id) is ~dozens of bytes; a 100-event bucket amortizes that over 100 events where a per-event collection pays it 100 times.
  • Each write becomes an upsert + $inc / $push. Write amplification trade-off: an updateOne with upsert: true against a time-bucket _id is one network round-trip but the server may rewrite the whole document if it grows past its in-place-update budget (MongoDB uses power-of-two allocation + moveable documents; WiredTiger does copy-on-write on any update).
  • Queries filter the bucket then project the slice. Reads need both $match on the bucket _id range and an in-document filter on the inner array/sub-document. Aggregation pipeline
  • $filter / $reduce / $objectToArray is the standard shape.

When to use

  • Time-bucketed counters / metrics / events. Per-key status counts binned by day / month / quarter — the MongoDB "Cost of Not Knowing" event-counter running example fits exactly.
  • IoT sensor streams. A single device writes samples at regular intervals; one document per device per hour is a common starting point.
  • Log aggregation where queries are "last hour per source" style, not "this one specific event."
  • Pre-aggregation surfaces. Combined with the concepts/computed-pattern — bucket by window, pre-aggregate per-sub-bucket status totals at write time.

When not to use

  • Point-lookup workloads on individual events. If the primary query is "fetch event 12345," bucketing adds an extraction step with no corresponding read savings.
  • Unbounded bucket cardinality. MongoDB's per-document limit is 16 MB; buckets that grow without bound hit the ceiling and require either finer time-bucketing or a dynamic re-bucketing step.
  • Workloads without a natural bucketing dimension. Events without a time axis or without a high-cardinality grouping key don't benefit; the bucket becomes a meaningless wrapper.

Relationship to other MongoDB schema patterns

  • concepts/computed-pattern — often applied on top of Bucket: each bucket stores pre-aggregated status counters (e.g. {a: 10, n: 3, p: 0, r: 1}) rather than raw events.
  • patterns/dynamic-schema-field-name-encoding — further shrinks the bucket's inner array to a sub-document whose field names encode data (day-of-month, day-of-quarter). MongoDB's Cost of Not Knowing Part 3 treats this as the natural next step after Bucket + Computed.
  • Attribute Pattern — cousin: denormalize variable attributes into an array of {k, v} objects. Different axis — attribute heterogeneity, not temporal grouping — but shares the denormalize-into-nested-structure move.

Seen in

  • sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — baseline of the whole appV5RX / appV6RX family. Part 2 (not yet ingested) introduced the Bucket + Computed combination; Part 3 builds the dynamic-schema variation on top. appV5R0 bucketed by year-in-_id + items array; appV5R1 → appV5R4 varied the bucketing granularity (year / quarter / month) and the per-element aggregation (raw event vs computed totals). Quarter-bucketing with per-day computed totals (appV5R3) was the Part-2 winner — 33 M documents, 11.96 GB data, 1.11 GB index, 385 B avg document from 500 M events.
Last updated · 200 distilled / 1,178 read