CONCEPT Cited by 1 source
Aggregation pipeline¶
Definition¶
Aggregation pipeline is MongoDB's
declarative server-side query framework: an ordered sequence of
stages ($match, $project, $group, $addFields, $lookup,
$sort, $limit, $unwind, and many more) where each stage takes
a stream of documents and emits a transformed stream to the next.
The pipeline runs inside the server; only the final stage's output
crosses the wire to the client.
The equivalent role SQL's SELECT ... FROM ... WHERE ... GROUP BY ...
HAVING ... ORDER BY plays in the relational world, but with:
- Explicit stage ordering — you author the plan, not the plan
author. An early
$match+$limitbefore an expensive$lookupis the user's choice. - Document-native operators —
$unwind(array → documents),$objectToArray(document → array),$arrayToObject(inverse),$reduce(folding),$map(per-element transform),$filter(subset) for working with nested structures without moving them across the wire. - Composable re-use via
$facet— run multiple pipelines on the same input, return all results in one response.
Pipeline shape (canonical event-counter case study)¶
From the MongoDB Cost of Not Knowing Part 3:
const pipeline = [
{ $match: docsFromKeyBetweenDate }, // bucket-level range filter on _id
{ $addFields: buildTotalsField }, // per-document: items → totals
{ $group: groupSumTotals }, // cross-document sum
{ $project: { _id: 0 } }, // shape the output
];
Each stage's role:
$match— server-side filter, can use indexes. The only stage that benefits from index scans; anything after$matchprocesses a stream of in-memory documents.$addFields— enrich each document in-flight. In the Part 3 dynamic-schema case, this is where$objectToArray+$reduceconvert the dynamicitemssub-document into per-day totals within the report's date range.$group— reduce multiple documents to fewer documents.$sum,$avg,$push,$first,$lastare the common accumulators.$project— reshape output. Keep / drop / compute fields.
Trade-offs / lessons from Part 3 case study¶
- Schema complexity shifts compute to the pipeline. The
dynamic schema (patterns/dynamic-schema-field-name-encoding)
pushes per-day information into field names; the aggregation
pipeline pays for that at read time with
$objectToArray+$reduce+ string-to-Date reconstruction per matched document. Net: compute-for-storage trade-off. Bytes saved on disk / cache are partly paid back as CPU cycles on reads. - Index usage ends at
$match. Once the pipeline leaves the matched document set and starts running expressions, no index helps.$addFields/$group/$projectare stream-processing against the in-memory document flow. Designing for an early$matchstage with an indexed predicate is the dominant first-pass perf heuristic. $objectToArraycost is linear in document density. A bucket with 90 items (worst case cited in MongoDB's case study) pays 90 × conversion + 90 × reduce-iteration per matched document. High-density buckets amortize fixed costs better but ceiling per-document work scales with density.- Sort / group memory limit. By default,
$sort+$groupstages hit MongoDB's 100 MB per-stage memory limit; exceeding it requiresallowDiskUse: true, which spills to disk and typically collapses pipeline throughput. $facetfor multi-range queries. MongoDB's case study runs "five aggregation pipelines, one for each date interval" — each is an independent client call. A$facetstage could run all five date-range variants in one pipeline, trading per-query parallelism for per-client call reduction. Not exercised in the article but a natural follow-up.
Seen in¶
- sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 —
every Get Reports read path in the
appV4+ / appV5X / appV6Xfamily is a 4-stage aggregation pipeline. Part 3's novelty is the$addFieldsstage doing$objectToArray+$reduceover the dynamic-schema sub-document, reconstructing dates from_id-derived year/month + field-name-derived day-within-window. The aggregation cost is the hidden price of the dynamic schema's storage wins.