PATTERN Cited by 1 source

Dynamic-schema field-name encoding¶

Intent¶

Reduce document size + storage footprint of MongoDB collections that use the concepts/bucket-pattern + concepts/computed-pattern combo by promoting a bounded-cardinality discriminator from a value position to a field-name position inside a sub-document, eliminating per-element BSON overhead for the discriminator.

Context¶

The starting point is a schema already using Bucket + Computed: events are grouped into time-bucket documents whose items is an array of {date, <pre-aggregated status fields>}. Example (appV5R3 from MongoDB Cost of Not Knowing Part 2):

{
  _id: <key + year + quarter>,  // coarse-grained filter
  items: [
    { date: 2022-06-05, a: 10, n: 3 },
    { date: 2022-06-16, p: 1, r: 1 },
    { date: 2022-06-27, a: 5, r: 1 },
    ...
  ]
}

Observed problem: document size is still dominated by the items array, specifically by the repeated "date" field name and its per-element BSON overhead. Load-test shows disk throughput — how many bytes/sec the server can read compressed pages off disk — is the remaining bottleneck.

Solution¶

Promote the discriminator (here, the date) from a value in each element to the field name of a sub-document:

{
  _id: <key + year + quarter>,
  items: {
    "0605": { a: 10, n: 3 },   // "0605" = June 5 (MMDD)
    "0616": { p: 1, r: 1 },
    "0627": { a: 5, r: 1 },
    ...
  }
}

Because the year and quarter already live in _id, only the within-quarter day-granular discriminator needs encoding. The "date" field-name tax (N bytes per element) is eliminated in favor of a single per-element field-name whose value is the discriminator itself.

Writes remain updateOne + $inc:

const MMDD = getMMDD(event.date);
updateOne({
  filter: { _id: buildId(event.key, event.date) },
  update: { $inc: {
    [`items.${MMDD}.a`]: event.approved,
    [`items.${MMDD}.n`]: event.noFunds,
    [`items.${MMDD}.p`]: event.pending,
    [`items.${MMDD}.r`]: event.rejected,
  } },
  upsert: true,
});

Reads pay in the aggregation pipeline: $objectToArray + $reduce + per-element date reconstruction from _id-derived year + field-name-derived month-day.

Measured impact (MongoDB case study)¶

Revision	Schema shape	Doc size	Docs	Data	Index
appV5R3	Bucket + Computed, static items[]	385 B	33.4 M	11.96 GB	1.11 GB
appV6R0	+ Dynamic schema, monthly bucket	125 B	95.3 M	11.1 GB	3.13 GB
appV6R1	+ Dynamic schema, quarterly bucket	264 B	33.4 M	8.19 GB	1.22 GB

appV6R1 is the pattern's canonical win: 31.4 % smaller documents and 28.1 % smaller per-event total footprint vs. appV5R3 at equivalent index pressure.

appV6R0 demonstrates the pattern's headline caveat: shrinking documents without considering index cardinality can shift the bottleneck from disk-throughput to index-in-cache, producing no net throughput gain.

Consequences¶

Benefits¶

Substantial document-size shrink (67.5 % at the single- revision level, 31.4 % after accounting for bucketing-width re-tuning).
Better per-event storage density. 17.6 B data / event vs 25.7 B baseline (31.5 % reduction).
Amortization of BSON per-document overhead. Wider buckets mean each document's fixed BSON overhead (_id, length prefix, trailing null) amortizes over more events. Document size doesn't scale linearly with bucketing range — the case study's 3×-wider bucket only 2×'d the document.
Works with $inc upsert semantics natively. No special initialization path for new bucket days; MongoDB treats missing fields as zero for $inc.

Costs¶

Query-side complexity. Range-filtering on the encoded discriminator requires $objectToArray + $reduce + date reconstruction from pieces of _id + field name. More CPU per matched document; harder to author; harder to debug.
No index on the encoded dimension. The dynamic sub-document is opaque to the index system. All filtering within a bucket happens in-memory inside the aggregation pipeline, after $match on _id.
Harder schema validation. JSON Schema patternProperties or bypassing validation on the dynamic field.
Opaque to Compass / generic tooling. UI shows "0605" as a literal field name with no semantic hint.
Requires bounded discriminator cardinality. Works for day-of-quarter (max 92 values). Breaks for unbounded-cardinality discriminators like full timestamps.
Bottleneck migration is guaranteed. The pattern is so successful at shrinking documents that the next-slowest resource (typically index-in-cache) becomes the new dominant cost. Must be applied in a measure-change- re-measure loop, not open-loop.

When to use¶

Time-bucketed counter / metric workloads with a single coarse index predicate (_id range-filter) and a dense inner aggregation-pipeline step.
After lighter schema-shrink techniques (field-name shortening, data-type tightening, Bucket + Computed) have already been applied and disk throughput is the measured remaining bottleneck.
On workloads where reads tolerate extra aggregation compute in exchange for storage efficiency, and write rates are high enough that storage-cost savings actually matter.

When not to use¶

Workloads without a bounded, low-cardinality within-bucket discriminator.
Workloads dominated by index-in-cache pressure rather than disk throughput — the pattern can make the wrong resource bigger (see appV6R0).
Applications that need per-event indexing for ad-hoc queries.
Hot-path reads where aggregation-pipeline CPU is already saturated.

Seen in¶

sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — appV6R0 (monthly bucket, "DD" field names) and appV6R1 (quarterly bucket, "MMDD" field names). appV6R1 is the canonical wiki-facing instance of this pattern successfully applied. Author notes the pattern "isn't very common to see" and attributes its construction to a senior MongoDB developer who built it on top of all previous appV5RX revisions.