Skip to content

PATTERN Cited by 1 source

Dynamic-schema field-name encoding

Intent

Reduce document size + storage footprint of MongoDB collections that use the concepts/bucket-pattern + concepts/computed-pattern combo by promoting a bounded-cardinality discriminator from a value position to a field-name position inside a sub-document, eliminating per-element BSON overhead for the discriminator.

Context

The starting point is a schema already using Bucket + Computed: events are grouped into time-bucket documents whose items is an array of {date, <pre-aggregated status fields>}. Example (appV5R3 from MongoDB Cost of Not Knowing Part 2):

{
  _id: <key + year + quarter>,  // coarse-grained filter
  items: [
    { date: 2022-06-05, a: 10, n: 3 },
    { date: 2022-06-16, p: 1, r: 1 },
    { date: 2022-06-27, a: 5, r: 1 },
    ...
  ]
}

Observed problem: document size is still dominated by the items array, specifically by the repeated "date" field name and its per-element BSON overhead. Load-test shows disk throughput — how many bytes/sec the server can read compressed pages off disk — is the remaining bottleneck.

Solution

Promote the discriminator (here, the date) from a value in each element to the field name of a sub-document:

{
  _id: <key + year + quarter>,
  items: {
    "0605": { a: 10, n: 3 },   // "0605" = June 5 (MMDD)
    "0616": { p: 1, r: 1 },
    "0627": { a: 5, r: 1 },
    ...
  }
}

Because the year and quarter already live in _id, only the within-quarter day-granular discriminator needs encoding. The "date" field-name tax (N bytes per element) is eliminated in favor of a single per-element field-name whose value is the discriminator itself.

Writes remain updateOne + $inc:

const MMDD = getMMDD(event.date);
updateOne({
  filter: { _id: buildId(event.key, event.date) },
  update: { $inc: {
    [`items.${MMDD}.a`]: event.approved,
    [`items.${MMDD}.n`]: event.noFunds,
    [`items.${MMDD}.p`]: event.pending,
    [`items.${MMDD}.r`]: event.rejected,
  } },
  upsert: true,
});

Reads pay in the aggregation pipeline: $objectToArray + $reduce + per-element date reconstruction from _id-derived year + field-name-derived month-day.

Measured impact (MongoDB case study)

Revision Schema shape Doc size Docs Data Index
appV5R3 Bucket + Computed, static items[] 385 B 33.4 M 11.96 GB 1.11 GB
appV6R0 + Dynamic schema, monthly bucket 125 B 95.3 M 11.1 GB 3.13 GB
appV6R1 + Dynamic schema, quarterly bucket 264 B 33.4 M 8.19 GB 1.22 GB

appV6R1 is the pattern's canonical win: 31.4 % smaller documents and 28.1 % smaller per-event total footprint vs. appV5R3 at equivalent index pressure.

appV6R0 demonstrates the pattern's headline caveat: shrinking documents without considering index cardinality can shift the bottleneck from disk-throughput to index-in-cache, producing no net throughput gain.

Consequences

Benefits

  • Substantial document-size shrink (67.5 % at the single- revision level, 31.4 % after accounting for bucketing-width re-tuning).
  • Better per-event storage density. 17.6 B data / event vs 25.7 B baseline (31.5 % reduction).
  • Amortization of BSON per-document overhead. Wider buckets mean each document's fixed BSON overhead (_id, length prefix, trailing null) amortizes over more events. Document size doesn't scale linearly with bucketing range — the case study's 3×-wider bucket only 2×'d the document.
  • Works with $inc upsert semantics natively. No special initialization path for new bucket days; MongoDB treats missing fields as zero for $inc.

Costs

  • Query-side complexity. Range-filtering on the encoded discriminator requires $objectToArray + $reduce + date reconstruction from pieces of _id + field name. More CPU per matched document; harder to author; harder to debug.
  • No index on the encoded dimension. The dynamic sub-document is opaque to the index system. All filtering within a bucket happens in-memory inside the aggregation pipeline, after $match on _id.
  • Harder schema validation. JSON Schema patternProperties or bypassing validation on the dynamic field.
  • Opaque to Compass / generic tooling. UI shows "0605" as a literal field name with no semantic hint.
  • Requires bounded discriminator cardinality. Works for day-of-quarter (max 92 values). Breaks for unbounded-cardinality discriminators like full timestamps.
  • Bottleneck migration is guaranteed. The pattern is so successful at shrinking documents that the next-slowest resource (typically index-in-cache) becomes the new dominant cost. Must be applied in a measure-change- re-measure loop, not open-loop.

When to use

  • Time-bucketed counter / metric workloads with a single coarse index predicate (_id range-filter) and a dense inner aggregation-pipeline step.
  • After lighter schema-shrink techniques (field-name shortening, data-type tightening, Bucket + Computed) have already been applied and disk throughput is the measured remaining bottleneck.
  • On workloads where reads tolerate extra aggregation compute in exchange for storage efficiency, and write rates are high enough that storage-cost savings actually matter.

When not to use

  • Workloads without a bounded, low-cardinality within-bucket discriminator.
  • Workloads dominated by index-in-cache pressure rather than disk throughput — the pattern can make the wrong resource bigger (see appV6R0).
  • Applications that need per-event indexing for ad-hoc queries.
  • Hot-path reads where aggregation-pipeline CPU is already saturated.

Seen in

  • sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — appV6R0 (monthly bucket, "DD" field names) and appV6R1 (quarterly bucket, "MMDD" field names). appV6R1 is the canonical wiki-facing instance of this pattern successfully applied. Author notes the pattern "isn't very common to see" and attributes its construction to a senior MongoDB developer who built it on top of all previous appV5RX revisions.
Last updated · 200 distilled / 1,178 read