Skip to content

Denormalization

TL;DR

Denormalization intentionally duplicates data to make reads faster. You sacrifice write simplicity and risk inconsistency, but your hot read paths avoid expensive joins. The rule: normalize first, denormalize only when a specific query demands it.

The Tension

In the last lesson, we normalized everything. Each fact stored once. No redundancy. Clean and correct.

But then you write the query for your most critical page — the home feed — and it joins 5 tables, aggregates likes, and sorts by timestamp. It takes 200ms. Your product manager wants it under 50ms. Your users want it under 20ms.

This is the fundamental tension in data modelling: normalization optimizes for writes and consistency. Denormalization optimizes for reads and speed. Every real system lives somewhere on this spectrum.

What Denormalization Actually Means

Denormalization means deliberately storing redundant data to avoid expensive operations at read time.

Example 1: Embedding the Author's Name

Normalized: To display a post with its author's name, you join two tables:

SELECT posts.content, users.username
FROM posts
JOIN users ON posts.user_id = users.id
WHERE posts.id = 42;

Denormalized: Store the author's name directly on the post:

-- No join needed
SELECT content, author_username
FROM posts
WHERE id = 42;

The author_username column is redundant — it duplicates data from the users table. But it eliminates a join on every post read. For a feed showing 50 posts, that's 50 joins you just avoided.

Example 2: Pre-computed Counts

Normalized: Count likes by querying the likes table:

SELECT COUNT(*) FROM likes WHERE post_id = 42;

Denormalized: Store the count directly on the post:

SELECT like_count FROM posts WHERE id = 42;

Every time someone likes or unlikes a post, you increment or decrement like_count. The count is always available without scanning the likes table.

Instagram does this. When a post by a celebrity has 3 million likes, you don't want to COUNT(*) three million rows every time someone views it.

Example 3: Materialized Feeds

Normalized: Compute the home feed at read time by joining posts, follows, and sorting:

SELECT posts.* FROM posts
JOIN follows ON posts.user_id = follows.following_id
WHERE follows.follower_id = ?
ORDER BY posts.created_at DESC
LIMIT 50;

Denormalized: Pre-compute the feed into a feed_items table:

feed_items
┌─────────┬─────────┬────────────┬─────────┐
│ user_id │ post_id │ created_at │ author  │
├─────────┼─────────┼────────────┼─────────┤
│    1    │   500   │ 2024-03-15 │ bob     │
│    1    │   499   │ 2024-03-15 │ carol   │
│    1    │   497   │ 2024-03-14 │ bob     │
└─────────┴─────────┴────────────┴─────────┘

Reading the feed is now a simple range scan on one table. The complexity shifts to write time — when someone posts, you insert a feed_item for each of their followers. This is the fan-out on write pattern that Twitter uses (more on this in Chapter 7).

The Cost of Denormalization

Every denormalization creates two problems:

1. Consistency Becomes Your Responsibility

When Alice changes her username, you now need to update: - The users table (source of truth) - The author_username on every post she's ever written - The author field in every feed_item - Any other table that duplicated her name

Miss one, and your data is inconsistent. The database won't help you here — there's no foreign key constraint on denormalized copies.

Strategies for handling this: - Synchronous updates: Update all copies in the same transaction. Correct but slow. - Background jobs: Update the source of truth immediately, then queue a job to update copies. Faster but temporarily inconsistent. - Accept staleness: For some data (like a cached feed), it's fine if it's a few seconds behind. A feed showing "alice" instead of "alice_new_name" for 30 seconds won't break anything.

2. Writes Get More Complex

Every denormalized field is another thing to update when the source data changes. A like_count on posts means every like and unlike must atomically increment or decrement the counter:

-- When someone likes a post:
BEGIN;
INSERT INTO likes (user_id, post_id) VALUES (?, ?);
UPDATE posts SET like_count = like_count + 1 WHERE id = ?;
COMMIT;

-- When someone unlikes:
BEGIN;
DELETE FROM likes WHERE user_id = ? AND post_id = ?;
UPDATE posts SET like_count = like_count - 1 WHERE id = ?;
COMMIT;

If the application crashes between the INSERT and UPDATE, the count drifts. You might need periodic reconciliation jobs to fix drift over time.

When to Denormalize

Denormalization is a performance optimization. Like all optimizations, don't do it until you have a reason.

Denormalize when:

Signal Example
A hot read path requires joins that add latency Home feed joining posts + follows + users
An aggregation runs on every page load COUNT(*) for likes, comments, followers
The same join appears in many queries Author name shown on posts, comments, notifications
Read/write ratio is heavily skewed toward reads Social media (1000:1 reads to writes on popular content)

Don't denormalize when:

Signal Example
The data changes frequently User email (changes create mass updates across tables)
Consistency is critical Financial transactions, inventory counts
The read pattern is simple Direct lookups by primary key (already fast with an index)
A cache would solve it If the data fits in Redis, cache the join result instead of denormalizing

Cache vs Denormalize — A Key Decision

Before denormalizing, ask: would a cache solve this problem?

A cache (Redis, Memcached) gives you the same read speed benefit without polluting your schema:

Cache-aside pattern: client checks cache first, on miss queries DB and stores result

Denormalization Cache
Where In the database itself Separate system (Redis)
Consistency Must update on every write Must invalidate on every write
Failure mode Wrong data silently served Cache miss → slower but correct
Complexity Schema changes, migration needed Application code only
Best for Stable, rarely-changing denormalized data Frequently-changing or expensive-to-compute data

Further reading: For a deeper dive into caching patterns (cache-aside, write-through, write-behind, cache stampede, thundering herd), see the Caching and Redis articles.

Interview Tip

When proposing denormalization in an interview, always acknowledge the trade-off: "I'll store like_count directly on the post to avoid counting millions of rows on each read. The trade-off is that likes and unlikes now need to atomically update both the likes table and the counter. I'd also run a periodic reconciliation job to fix any drift." That shows mature engineering judgment.

A Practical Framework

Here's how to think about denormalization in a system design interview:

Denormalization decision framework: start normalized, check cache, evaluate joins, choose technique, explain consistency

Quick Recap

Technique What It Does Good For Cost
Embedded fields Copy a field from related table Avoiding joins on hot paths Updates must propagate to all copies
Pre-computed counts Store aggregates directly Avoiding COUNT(*) at scale Must keep counter in sync
Materialized feeds Pre-build query results Complex multi-table reads Write amplification
Caching Store computed results in Redis Any hot read path Cache invalidation complexity

The goal isn't to avoid denormalization — it's to denormalize intentionally, for specific queries, with a clear understanding of the trade-offs.