PATTERN Cited by 1 source
Dynamic-schema field-name encoding¶
Intent¶
Reduce document size + storage footprint of MongoDB collections that use the concepts/bucket-pattern + concepts/computed-pattern combo by promoting a bounded-cardinality discriminator from a value position to a field-name position inside a sub-document, eliminating per-element BSON overhead for the discriminator.
Context¶
The starting point is a schema already using Bucket + Computed:
events are grouped into time-bucket documents whose items is an
array of {date, <pre-aggregated status fields>}. Example
(appV5R3 from MongoDB Cost of Not Knowing Part 2):
{
_id: <key + year + quarter>, // coarse-grained filter
items: [
{ date: 2022-06-05, a: 10, n: 3 },
{ date: 2022-06-16, p: 1, r: 1 },
{ date: 2022-06-27, a: 5, r: 1 },
...
]
}
Observed problem: document size is still dominated by the
items array, specifically by the repeated "date" field name
and its per-element BSON overhead. Load-test shows disk
throughput — how many bytes/sec the server can read compressed
pages off disk — is the remaining bottleneck.
Solution¶
Promote the discriminator (here, the date) from a value in each element to the field name of a sub-document:
{
_id: <key + year + quarter>,
items: {
"0605": { a: 10, n: 3 }, // "0605" = June 5 (MMDD)
"0616": { p: 1, r: 1 },
"0627": { a: 5, r: 1 },
...
}
}
Because the year and quarter already live in _id, only the
within-quarter day-granular discriminator needs encoding. The
"date" field-name tax (N bytes per element) is eliminated in
favor of a single per-element field-name whose value is the
discriminator itself.
Writes remain updateOne + $inc:
const MMDD = getMMDD(event.date);
updateOne({
filter: { _id: buildId(event.key, event.date) },
update: { $inc: {
[`items.${MMDD}.a`]: event.approved,
[`items.${MMDD}.n`]: event.noFunds,
[`items.${MMDD}.p`]: event.pending,
[`items.${MMDD}.r`]: event.rejected,
} },
upsert: true,
});
Reads pay in the aggregation pipeline: $objectToArray +
$reduce + per-element date reconstruction from _id-derived
year + field-name-derived month-day.
Measured impact (MongoDB case study)¶
| Revision | Schema shape | Doc size | Docs | Data | Index |
|---|---|---|---|---|---|
| appV5R3 | Bucket + Computed, static items[] | 385 B | 33.4 M | 11.96 GB | 1.11 GB |
| appV6R0 | + Dynamic schema, monthly bucket | 125 B | 95.3 M | 11.1 GB | 3.13 GB |
| appV6R1 | + Dynamic schema, quarterly bucket | 264 B | 33.4 M | 8.19 GB | 1.22 GB |
appV6R1 is the pattern's canonical win: 31.4 % smaller documents and 28.1 % smaller per-event total footprint vs. appV5R3 at equivalent index pressure.
appV6R0 demonstrates the pattern's headline caveat: shrinking documents without considering index cardinality can shift the bottleneck from disk-throughput to index-in-cache, producing no net throughput gain.
Consequences¶
Benefits¶
- Substantial document-size shrink (67.5 % at the single- revision level, 31.4 % after accounting for bucketing-width re-tuning).
- Better per-event storage density. 17.6 B data / event vs 25.7 B baseline (31.5 % reduction).
- Amortization of BSON per-document overhead. Wider buckets
mean each document's fixed BSON overhead (
_id, length prefix, trailing null) amortizes over more events. Document size doesn't scale linearly with bucketing range — the case study's 3×-wider bucket only 2×'d the document. - Works with
$incupsert semantics natively. No special initialization path for new bucket days; MongoDB treats missing fields as zero for$inc.
Costs¶
- Query-side complexity. Range-filtering on the encoded
discriminator requires
$objectToArray+$reduce+ date reconstruction from pieces of_id+ field name. More CPU per matched document; harder to author; harder to debug. - No index on the encoded dimension. The dynamic sub-document
is opaque to the index system. All filtering within a bucket
happens in-memory inside the aggregation pipeline, after
$matchon_id. - Harder schema validation. JSON Schema
patternPropertiesor bypassing validation on the dynamic field. - Opaque to Compass / generic tooling. UI shows
"0605"as a literal field name with no semantic hint. - Requires bounded discriminator cardinality. Works for day-of-quarter (max 92 values). Breaks for unbounded-cardinality discriminators like full timestamps.
- Bottleneck migration is guaranteed. The pattern is so successful at shrinking documents that the next-slowest resource (typically index-in-cache) becomes the new dominant cost. Must be applied in a measure-change- re-measure loop, not open-loop.
When to use¶
- Time-bucketed counter / metric workloads with a single coarse
index predicate (
_idrange-filter) and a dense inner aggregation-pipeline step. - After lighter schema-shrink techniques (field-name shortening, data-type tightening, Bucket + Computed) have already been applied and disk throughput is the measured remaining bottleneck.
- On workloads where reads tolerate extra aggregation compute in exchange for storage efficiency, and write rates are high enough that storage-cost savings actually matter.
When not to use¶
- Workloads without a bounded, low-cardinality within-bucket discriminator.
- Workloads dominated by index-in-cache pressure rather than disk throughput — the pattern can make the wrong resource bigger (see appV6R0).
- Applications that need per-event indexing for ad-hoc queries.
- Hot-path reads where aggregation-pipeline CPU is already saturated.
Seen in¶
- sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 —
appV6R0 (monthly bucket,
"DD"field names) and appV6R1 (quarterly bucket,"MMDD"field names). appV6R1 is the canonical wiki-facing instance of this pattern successfully applied. Author notes the pattern "isn't very common to see" and attributes its construction to a senior MongoDB developer who built it on top of all previous appV5RX revisions.