Skip to content

CONCEPT Cited by 1 source

Aggregation pipeline

Definition

Aggregation pipeline is MongoDB's declarative server-side query framework: an ordered sequence of stages ($match, $project, $group, $addFields, $lookup, $sort, $limit, $unwind, and many more) where each stage takes a stream of documents and emits a transformed stream to the next. The pipeline runs inside the server; only the final stage's output crosses the wire to the client.

The equivalent role SQL's SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY plays in the relational world, but with:

  • Explicit stage ordering — you author the plan, not the plan author. An early $match + $limit before an expensive $lookup is the user's choice.
  • Document-native operators$unwind (array → documents), $objectToArray (document → array), $arrayToObject (inverse), $reduce (folding), $map (per-element transform), $filter (subset) for working with nested structures without moving them across the wire.
  • Composable re-use via $facet — run multiple pipelines on the same input, return all results in one response.

Pipeline shape (canonical event-counter case study)

From the MongoDB Cost of Not Knowing Part 3:

const pipeline = [
  { $match: docsFromKeyBetweenDate },  // bucket-level range filter on _id
  { $addFields: buildTotalsField },     // per-document: items → totals
  { $group: groupSumTotals },           // cross-document sum
  { $project: { _id: 0 } },             // shape the output
];

Each stage's role:

  1. $match — server-side filter, can use indexes. The only stage that benefits from index scans; anything after $match processes a stream of in-memory documents.
  2. $addFields — enrich each document in-flight. In the Part 3 dynamic-schema case, this is where $objectToArray + $reduce convert the dynamic items sub-document into per-day totals within the report's date range.
  3. $group — reduce multiple documents to fewer documents. $sum, $avg, $push, $first, $last are the common accumulators.
  4. $project — reshape output. Keep / drop / compute fields.

Trade-offs / lessons from Part 3 case study

  • Schema complexity shifts compute to the pipeline. The dynamic schema (patterns/dynamic-schema-field-name-encoding) pushes per-day information into field names; the aggregation pipeline pays for that at read time with $objectToArray + $reduce + string-to-Date reconstruction per matched document. Net: compute-for-storage trade-off. Bytes saved on disk / cache are partly paid back as CPU cycles on reads.
  • Index usage ends at $match. Once the pipeline leaves the matched document set and starts running expressions, no index helps. $addFields / $group / $project are stream-processing against the in-memory document flow. Designing for an early $match stage with an indexed predicate is the dominant first-pass perf heuristic.
  • $objectToArray cost is linear in document density. A bucket with 90 items (worst case cited in MongoDB's case study) pays 90 × conversion + 90 × reduce-iteration per matched document. High-density buckets amortize fixed costs better but ceiling per-document work scales with density.
  • Sort / group memory limit. By default, $sort + $group stages hit MongoDB's 100 MB per-stage memory limit; exceeding it requires allowDiskUse: true, which spills to disk and typically collapses pipeline throughput.
  • $facet for multi-range queries. MongoDB's case study runs "five aggregation pipelines, one for each date interval" — each is an independent client call. A $facet stage could run all five date-range variants in one pipeline, trading per-query parallelism for per-client call reduction. Not exercised in the article but a natural follow-up.

Seen in

  • sources/2025-10-09-mongodb-cost-of-not-knowing-mongodb-part-3-appv6r0-to-appv6r4 — every Get Reports read path in the appV4+ / appV5X / appV6X family is a 4-stage aggregation pipeline. Part 3's novelty is the $addFields stage doing $objectToArray + $reduce over the dynamic-schema sub-document, reconstructing dates from _id-derived year/month + field-name-derived day-within-window. The aggregation cost is the hidden price of the dynamic schema's storage wins.
Last updated · 200 distilled / 1,178 read