Aggregation Pipeline and Schema Patterns

TL;DR

The aggregation pipeline is MongoDB's answer to SQL GROUP BY, and the schema patterns are recipes for modeling data that doesn't fit neatly into a single document -- learn five patterns and you can handle 90% of system design scenarios.

What It Is

Aggregation Pipeline

The aggregation pipeline is MongoDB's way of processing documents through a sequence of stages. Each stage transforms the data and passes the result to the next stage. Think of it as Unix pipes for your database.

db.orders.aggregate([
  { $match: { status: "completed" } },          // Stage 1: filter
  { $group: { _id: "$product_id",               // Stage 2: group
              total: { $sum: "$amount" } } },
  { $sort: { total: -1 } },                     // Stage 3: sort
  { $limit: 10 }                                // Stage 4: top 10
])

This is equivalent to:

SELECT product_id, SUM(amount) AS total
FROM orders
WHERE status = 'completed'
GROUP BY product_id
ORDER BY total DESC
LIMIT 10;

The SQL is shorter. More readable. Has a cost-based optimizer. If this is all you need, PostgreSQL is the better tool. But MongoDB's pipeline handles nested documents and arrays in ways SQL can't, and that's where it earns its keep.

Core Pipeline Stages

$match -- Filter Early

{ $match: { created_at: { $gte: ISODate("2024-01-01") } } }

Always put $match first. It reduces the number of documents flowing through later stages. If $match is the first stage and uses an indexed field, MongoDB uses the index. Put it after $group and you're scanning everything.

$group -- Aggregate

{
  $group: {
    _id: "$category",
    totalSales: { $sum: "$amount" },
    avgPrice: { $avg: "$price" },
    count: { $sum: 1 },
    products: { $push: "$name" }
  }
}

Accumulators: $sum, $avg, $min, $max, $push (collect into array), $addToSet (unique values), $first, $last.

_id: null groups all documents into one -- useful for global aggregates.

$project -- Reshape

{
  $project: {
    fullName: { $concat: ["$firstName", " ", "$lastName"] },
    year: { $year: "$created_at" },
    discountedPrice: { $multiply: ["$price", 0.9] },
    _id: 0  // exclude _id
  }
}

Use $project to add computed fields, rename fields, or exclude fields. Similar to SELECT in SQL.

$sort -- Order Results

{ $sort: { total: -1, name: 1 } }  // descending total, ascending name

In-memory sort limit: 100MB. Exceed it and the pipeline fails unless you set allowDiskUse: true. Always index fields you sort on when $sort is near the beginning of the pipeline.

$unwind -- Flatten Arrays

// Before: { tags: ["redis", "mongodb", "kafka"] }
{ $unwind: "$tags" }
// After: 3 documents, each with one tag

$unwind explodes an array field into one document per element. Necessary before grouping by array elements. But be careful -- an array with 10K elements becomes 10K documents. Memory explodes.

Use preserveNullAndEmptyArrays: true to keep documents where the array is missing or empty.

$lookup -- Left Outer Join

{
  $lookup: {
    from: "authors",
    localField: "author_id",
    foreignField: "_id",
    as: "author_info"
  }
}

Result is always an array (even for one-to-one). Use $unwind after $lookup to flatten it.

Performance warning: $lookup is slow on large collections. No hash join. No merge join. It's a correlated subquery executed for each input document. If you're writing $lookup in hot paths, reconsider your schema -- embed the data instead.

{
  $facet: {
    priceRange: [
      { $bucket: { groupBy: "$price", boundaries: [0, 50, 100, 500] } }
    ],
    topCategories: [
      { $group: { _id: "$category", count: { $sum: 1 } } },
      { $sort: { count: -1 } },
      { $limit: 5 }
    ],
    totalCount: [
      { $count: "total" }
    ]
  }
}

$facet runs multiple sub-pipelines on the same input. Each sub-pipeline gets the full set of input documents. Great for building dashboard responses in one query instead of three.

Pipeline Optimization

MongoDB's query planner applies some optimizations automatically.

Coalescence

Adjacent stages of the same type get merged:

$match + $match -> single $match with $and
$sort + $sort -> last $sort wins
$limit + $limit -> smallest $limit wins

Reordering

The planner moves $match before $project when possible. If $project doesn't affect the fields used in $match, the filter runs first to reduce documents early.

Index Usage

Only the first $match and $sort in the pipeline can use indexes. Everything after the first $group or $unwind runs without index support. This is why stage ordering matters so much.

Spicy take: MongoDB's query optimizer is a toy compared to PostgreSQL's. No cost-based optimization. No join reordering. No parallel query execution on a single node. If your aggregation is slow, the fix is usually rewriting the pipeline, not hoping the optimizer figures it out.

Schema Patterns for System Design Interviews

These patterns come from MongoDB's official "Building with Patterns" series. They solve specific data modeling problems that come up repeatedly.

1. Embedded Pattern (One-to-Few)

Embed the related data directly in the parent document.

// User with embedded addresses
{
  _id: ObjectId("..."),
  name: "Alice",
  addresses: [
    { type: "home", city: "Seattle", zip: "98101" },
    { type: "work", city: "Redmond", zip: "98052" }
  ]
}

When: Related data is always accessed with the parent. Bounded size (< 100 items). Updated infrequently.

When not: Related data is accessed independently. Array grows without bound. Updates to embedded docs are frequent (because updating one element in a large array is expensive).

2. Reference Pattern (One-to-Many)

Store the related data in a separate collection with a foreign key.

// Order document
{ _id: ObjectId("..."), user_id: ObjectId("user123"), total: 99.99 }

// Separate reviews collection
{ _id: ObjectId("..."), order_id: ObjectId("order456"), rating: 5, text: "Great" }

When: Related data is large or unbounded. Related data is queried independently. You need to update related docs without touching the parent.

Trade-off: Requires two queries or a $lookup to get both parent and related data. But avoids the 16MB document limit.

3. Bucket Pattern (Time-Series)

Group time-series data into fixed-size buckets instead of one document per event.

// Instead of one doc per temperature reading:
{
  sensor_id: "sensor-42",
  bucket_start: ISODate("2024-07-01T10:00:00Z"),
  bucket_end: ISODate("2024-07-01T11:00:00Z"),
  count: 60,
  readings: [
    { timestamp: ISODate("2024-07-01T10:00:00Z"), temp: 22.5 },
    { timestamp: ISODate("2024-07-01T10:01:00Z"), temp: 22.6 },
    // ... 58 more readings
  ],
  summary: {
    avg: 22.8,
    min: 22.1,
    max: 23.4
  }
}

Why: One document per reading = millions of tiny documents. Poor index performance. The bucket pattern reduces document count by 60x (one per hour vs one per minute). Pre-computed summaries avoid scanning all readings for aggregate queries.

InfluxDB and TimescaleDB exist. If your system is purely time-series, use a purpose-built time-series database. MongoDB's bucket pattern is for when you have some time-series data alongside other document data.

4. Outlier Pattern (Handle Large Arrays)

Most documents have a small array, but a few outliers have huge arrays. Don't let the outliers dictate the schema.

// Normal book: a few hundred sales per month
{
  _id: "book-123",
  title: "Advanced MongoDB",
  monthly_sales: [
    { month: "2024-01", count: 150 },
    { month: "2024-02", count: 200 }
  ],
  has_overflow: false
}

// Bestseller: millions of sales, overflow to separate collection
{
  _id: "book-456",
  title: "Harry Potter",
  monthly_sales: [
    { month: "2024-01", count: 500000 },
    { month: "2024-02", count: 450000 }
  ],
  has_overflow: true  // signal to app: check overflow collection
}

When: 99% of documents fit in one document. 1% don't. The has_overflow flag tells the application to check a secondary collection for the full data. This keeps the common case fast.

Amazon uses a variant of this. Most products have a few reviews. Some viral products have millions. Different storage strategies for each.

5. Subset Pattern (Cache Hot Data)

Store frequently accessed fields in the main document. Put rarely accessed fields in a separate collection.

// Product document (hot data -- loaded on every page view)
{
  _id: "product-789",
  name: "Mechanical Keyboard",
  price: 149.99,
  thumbnail: "https://cdn.example.com/thumb/789.jpg",
  rating: 4.5,
  review_count: 342
}

// Product details (cold data -- loaded on "see more" click)
{
  product_id: "product-789",
  description: "Full mechanical keyboard with Cherry MX switches...",
  specifications: { ... },
  full_reviews: [ ... ],
  warranty_info: "..."
}

When: Documents have a mix of hot and cold fields. Loading everything on every read wastes memory and bandwidth. Split the document so the hot path reads a small, fast document.

This is application-level caching in the database layer. Some teams skip this pattern and use Redis for the hot data instead. Both work. The subset pattern avoids managing a separate cache.

Change Streams

Change streams let you watch for real-time data changes. Think of them as a database trigger that sends events to your application.

const changeStream = db.collection("orders").watch([
  { $match: { operationType: "insert" } }
]);

changeStream.on("change", (event) => {
  console.log("New order:", event.fullDocument);
  // Send notification, update search index, trigger workflow
});

How It Works

Change streams are built on the oplog. They tail the oplog and filter for changes to the watched collection. This means they work on replica sets and sharded clusters (but not standalone instances).

Use Cases

Event-driven architecture: New order -> send confirmation email, update inventory, notify warehouse. Replace polling with push.
Cache invalidation: Document changes -> invalidate the corresponding Redis key.
Search sync: Document changes -> update Elasticsearch index. This is how many teams keep MongoDB and Elasticsearch in sync.
Change Data Capture (CDC): Stream database changes to Kafka for downstream processing.

Resume Tokens

Each change event includes a resume token. Store it. If your application crashes, resume from that token to avoid missing events or reprocessing everything from the beginning.

Time-Series Collections

MongoDB 5.0 introduced native time-series collections. They're optimized for IoT and metrics data.

db.createCollection("sensor_data", {
  timeseries: {
    timeField: "timestamp",
    metaField: "sensor_id",
    granularity: "minutes"
  },
  expireAfterSeconds: 2592000  // auto-delete after 30 days
})

What's different: Time-series collections automatically bucket data by time and metadata field. They use columnar compression for time-series data. Queries on time ranges are significantly faster than regular collections.

When to use vs Bucket Pattern: If you're on MongoDB 5.0+ and your data is purely time-series, use native time-series collections. They do the bucketing automatically. Use the manual bucket pattern on older MongoDB versions or when you need more control over bucket structure.

MongoDB vs DynamoDB vs Cassandra

This comparison comes up in every system design interview that mentions NoSQL.

Factor	MongoDB	DynamoDB	Cassandra
Data model	Document (JSON/BSON)	Key-value + document	Wide-column
Query flexibility	Rich queries, aggregation pipeline	Primary key + GSI only	CQL (SQL-like, limited JOINs)
Consistency	Tunable (read/write concern)	Tunable (strong or eventual)	Tunable (ONE to ALL)
Scaling	Sharding (manual or Atlas auto)	Automatic (fully managed)	Automatic (consistent hashing)
Write throughput	Good (single primary per shard)	Excellent (distributed)	Excellent (leaderless)
Operations	Moderate (Atlas helps)	Zero (fully managed)	High (self-managed)
Cost model	Capacity-based (Atlas)	Per-request or provisioned	Hardware/nodes
Multi-region	Atlas Global Clusters	Global Tables	Multi-DC replication
Transactions	Multi-document ACID (since 4.0)	Per-item or TransactWriteItems (25 items max)	Lightweight transactions (Paxos)
Best for	Flexible schemas, rich queries	Predictable access patterns, serverless	Write-heavy, time-series at scale

Decision Framework

Choose MongoDB when: You need rich queries on document-shaped data, your schema evolves frequently, you want aggregation pipelines, and your team knows JavaScript/Python.

Choose DynamoDB when: You have predictable access patterns (known partition key + sort key), you want zero operational overhead, you're already in AWS, and your queries are simple lookups.

Choose Cassandra when: You need massive write throughput across multiple data centers, your data is time-series or event-log shaped, and you have an ops team that can manage Cassandra clusters. Discord uses Cassandra for message storage -- billions of writes per day across data centers.

Patterns for System Design Interviews

Pattern: Real-Time Dashboard

"Use the aggregation pipeline with $facet to compute multiple metrics in a single query. Pre-aggregate hourly/daily summaries using the bucket pattern. Change streams push updates to the dashboard WebSocket instead of polling."

Pattern: Event Sourcing with MongoDB

"Write events to an append-only collection. Use change streams to project current state into a read-optimized collection. Resume tokens ensure we never miss an event. This gives us a full audit log plus fast reads."

Pattern: Multi-Source Search Sync

"MongoDB is the source of truth. Change streams feed updates to a Kafka topic. A Kafka consumer writes to Elasticsearch for full-text search. If Elasticsearch goes down, replay from the Kafka topic."

NoSQL Comparison

Trade-Offs Table

Factor	Aggregation Pipeline	SQL (PostgreSQL)
Syntax	JSON stages (verbose)	Declarative SQL (concise)
Optimizer	Rule-based, limited	Cost-based, sophisticated
Nested data	Natural ($unwind, $project)	Requires JSONB functions
JOINs	$lookup (slow, limited)	Hash/Merge/Nested loop joins
Window functions	$setWindowFields (since 5.0)	Full support since forever
Index usage	First stages only	Throughout the query plan
Parallel execution	Single-threaded per pipeline	Parallel workers per query
Disk usage	allowDiskUse flag needed	Automatic temp files
Materialized views	$merge/$out to collection	Native materialized views

Interview Gotchas

"How do you optimize a slow aggregation?"

Put $match first to reduce input documents.
Create indexes on fields used in the first $match and $sort.
Use $project early to drop fields you don't need.
Set allowDiskUse: true for large sorts.
Consider pre-computing aggregates with the bucket pattern.

"When would you use $lookup?"

Almost never in hot paths. Use $lookup for analytics pipelines, admin dashboards, or one-time data migrations. For real-time queries, embed the data or accept a second query. Two fast queries beat one slow $lookup.

"How do you handle unbounded arrays?"

Use the Outlier Pattern (overflow to a separate collection) or the Bucket Pattern (fixed-size chunks). Never let an array grow without a cap. Set a has_overflow flag or a bucket_count to signal when the application needs to look elsewhere.

"What's the difference between $out and $merge?"

$out replaces the target collection entirely. $merge upserts into the target. Use $merge for incremental updates to materialized views. Use $out only when you want to rebuild from scratch.

"When would you use MongoDB's time-series collections vs InfluxDB?"

If your system already uses MongoDB and time-series is a secondary concern, use native time-series collections. If time-series is the primary workload, use a purpose-built database like InfluxDB or TimescaleDB. They have better compression, retention policies, and query languages for time-series data.

Summary

The aggregation pipeline is powerful but verbose. It handles nested documents and arrays better than SQL, but SQL's optimizer runs circles around it for joins and complex analytics. Use the pipeline for document-shaped aggregation. Use SQL for relational analytics.

Schema patterns are the real takeaway. Five patterns -- Embedded, Reference, Bucket, Outlier, Subset -- cover 90% of MongoDB data modeling decisions. Memorize them. When an interviewer says "how would you model X in MongoDB," reach for the right pattern instead of improvising. That's what separates a junior answer from a senior one.