Design a News Aggregation Platform

TL;DR

The defining problem of a news aggregator is NOT ingestion, caching, or pagination. It is content deduplication. The same AP wire story is published verbatim by 100+ outlets. Without dedup, 80% of a user's feed is identical content. SimHash (64-bit fingerprints where Hamming distance correlates with text similarity) solves near-duplicate detection. Story clustering (grouping different articles about the same event) is the next layer. Everything else -- RSS polling, cursor-based pagination, Redis sorted sets -- is standard infrastructure. If you spend your interview time on the standard parts and ignore dedup, you have designed a useless product.

Also: never store full article text. Google News only stores snippets and links to publisher sites. Storing full text violates copyright law in most jurisdictions (see AFP v. Google, Spain's "Google Tax," EU Copyright Directive Article 15). This is not a technical constraint -- it is a legal one that simplifies your storage model dramatically.

The System

Think Google News. The platform crawls or receives articles from 50,000 news publishers worldwide. It presents users with a personalized feed of news articles organized by topic (Sports, Politics, Tech, etc.) and region (US, EU, APAC). Articles about the same event are clustered together: "45 articles about the earthquake" appears as a single story with a "Full Coverage" link, not 45 separate items.

Google News launched in 2002 and processes 1-2 million new articles per day from 50,000+ sources in 35+ languages. It has 280M+ monthly active users. The platform does NOT host content -- it shows headlines, 2-3 sentence snippets, thumbnail images, and links. Every click goes to the publisher's website.

This is a fundamentally different system from a social media feed. There is no follow graph. There is no fan-out-on-write. The content is organized by topic and region, not by social relationships. Candidates who apply the Twitter/Facebook fan-out architecture here are solving the wrong problem.

Requirements

Functional Requirements

Requirement	Details
Article ingestion	Ingest articles from 50,000 publishers via RSS/Atom feeds, WebSub push, and web crawling
Feed by region/topic	Browse articles by region (US, EU, APAC) and topic (Sports, Politics, Tech)
Article deduplication	Detect and collapse near-duplicate articles (same AP wire story from 100+ outlets)
Story clustering	Group different articles about the same event into story clusters
Personalization	Customize feed based on user interests and reading history
Infinite scroll	Cursor-based pagination for browsing articles
Regional content	Users see news relevant to their country/region. A user in Japan sees Japanese sources prioritized; a user in the US sees US sources.
Breaking news	Surface breaking news within 30-60 seconds of first publication

Non-Functional Requirements

Requirement	Target
Feed load latency (p50)	< 200 ms
Article freshness	New articles appear in feeds within 2-5 minutes of publication
Breaking news freshness	Within 30-60 seconds
DAU	100 million
Availability	99.99% (users rely on this for time-sensitive decisions)
Read:write ratio	~1000:1 (12 articles/sec ingested vs 12,000 QPS served)

Back-of-Envelope Math

Publishers:              50,000
Articles per publisher per day: 20 average
New articles/day:        1,000,000
New articles/sec:        ~12/sec (trivial write volume)

DAU:                     100 million
Feed loads per user/day: 10 (homepage + category browsing)
Feed requests/day:       1 billion
Feed QPS:                ~12,000 average, ~36,000 peak (3x)

Article metadata size:   ~2 KB (title, URL, snippet, publisher, timestamp, category,
                         region, thumbnail URL -- NO full text)
30-day article storage:  1M/day * 30 * 2 KB = 60 GB (fits on single DB with indexes)

Polling load:
  50,000 feeds, tiered:
    500 major outlets:    every 5 min  = 6,000 polls/hr
    5,000 medium:         every 15 min = 20,000 polls/hr
    20,000 small:         every 30 min = 40,000 polls/hr
    24,500 niche/weekly:  every 2 hrs  = 12,250 polls/hr
    Total:                78,250 polls/hr = ~22 polls/sec

With conditional GET (304 Not Modified): ~70% save bandwidth
Actual new content fetches: ~7/sec

The critical realization: The write volume is trivially small (12 articles/sec). The read volume is moderate (36K QPS peak). The complexity is not in throughput -- it is in content processing (dedup, clustering, categorization).

Naive Design

Poll all 50,000 RSS feeds every 15 minutes.
Parse each feed, extract new articles.
Store article metadata in a database.
Serve feeds by querying the database: SELECT * FROM articles WHERE region = ? AND topic = ? ORDER BY published_at DESC LIMIT 20.

What goes wrong:

CNN publishes the same story as a top-level article AND in 3 different section feeds. You store it 4 times.
AP publishes a wire story. 100 newspapers copy it verbatim. You store 101 copies.
A user's feed shows the same earthquake story 47 times from 47 outlets.
Small publishers that post 3 articles/week are polled every 15 minutes -- 96 wasted requests per day per publisher.
Breaking news does not appear for up to 15 minutes (the polling interval).

Where It Breaks

Problem 1: No Deduplication (The Defining Problem)

Without dedup, ~40-60% of ingested articles are duplicates or near-duplicates. Wire services (AP, Reuters, AFP) produce content that is republished by hundreds of outlets. Press releases appear identically across dozens of sites. The feed is unusable without dedup.

Problem 2: No Story Clustering

Even after dedup, 15 unique articles about the same earthquake are still 15 separate feed items. Users want to see one story -- "Magnitude 7.2 Earthquake in Japan" -- with a "Full Coverage" link showing all perspectives.

Problem 3: Uniform Polling is Wasteful

Polling a weekly blogger every 15 minutes wastes 671 requests per week. Meanwhile, CNN publishes 50+ articles per day and should be polled every 5 minutes.

Problem 4: Full Text Storage is a Legal Minefield

Google News was sued by AFP (Agence France-Presse) for displaying headlines and snippets. Spain passed a "Google Tax" requiring payment for linking to news articles -- Google shut down Google News Spain entirely in 2014. The EU Copyright Directive (Article 15) forced Google to negotiate licensing deals worth over $1 billion. Storing and displaying full article text without publisher permission is copyright infringement in most jurisdictions.

Real Design

News Aggregation — News Aggregation High-Level Design

Architecture

  ┌─────────────────────────────────────────────┐
  │              Ingestion Layer                 │
  │                                              │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
  │  │ RSS/Atom  │  │ WebSub   │  │ Web      │  │
  │  │ Poller    │  │ Receiver │  │ Crawler  │  │
  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
  └───────┼──────────────┼─────────────┼────────┘
          │              │             │
          └──────────┬───┘─────────────┘
                     ▼
          ┌──────────────────┐
          │  Kafka: raw-articles│
          └────────┬─────────┘
                   │
      ┌────────────┼─────────────┬──────────────┐
      ▼            ▼             ▼              ▼
  ┌────────┐  ┌────────┐  ┌──────────┐  ┌──────────┐
  │ Dedup  │  │Category│  │ Story    │  │ Breaking │
  │Service │  │Tagger  │  │Clustering│  │ News     │
  └────┬───┘  └───┬────┘  └────┬─────┘  │ Detector │
       │          │             │        └────┬─────┘
       └──────────┴─────────────┘             │
                   │                          │
                   ▼                          ▼
          ┌──────────────────┐      ┌──────────────┐
          │  PostgreSQL       │      │ Push         │
          │  (article store)  │      │ Notifications│
          └────────┬─────────┘      └──────────────┘
                   │
          ┌────────┴────────┐
          │  Redis           │
          │  (feed cache)    │
          └────────┬────────┘
                   │
          ┌────────┴────────┐
          │  API Service     │
          └─────────────────┘

Component 1: Hybrid Push/Pull Ingestion

RSS/Atom Polling (the primary path):

Most publishers provide RSS or Atom feeds. The poller:

Maintains a priority queue of feeds sorted by next-poll-time.
Respects each publisher's ttl field and robots.txt crawl delay.
Uses conditional GET requests to avoid downloading unchanged feeds:

GET /rss/top-stories HTTP/1.1
Host: cnn.com
If-None-Match: "abc123"
If-Modified-Since: Thu, 17 Apr 2026 14:30:00 GMT

If the feed has not changed, the server returns 304 Not Modified (zero body, saves bandwidth). Store the ETag and Last-Modified header for each publisher feed and send them with every poll. This reduces actual content fetches by 60-80%.

Tiered polling schedule:

Tier	Publishers	Poll Interval	Rationale
1 (major outlets)	500	5 minutes	CNN, BBC, NYT publish 50+ articles/day
2 (regional/medium)	5,000	15 minutes	Regional papers, medium sites
3 (small/niche)	20,000	30 minutes	Local papers, niche blogs
4 (weekly/monthly)	24,500	2 hours	Weekly columnists, monthly publications

Adaptive polling: Track each publisher's actual article frequency. If a "Tier 3" publisher suddenly starts publishing 20 articles/day, promote them to Tier 2 automatically.

WebSub (W3C Standard Push Protocol):

WebSub (formerly PubSubHubbub) is the W3C standard for push-based feed updates:

Publisher declares a hub in their feed: <link rel="hub" href="https://hub.example.com" />
Our aggregator subscribes to the hub for that feed URL.
When the publisher updates the feed, they ping the hub.
The hub immediately pushes the new content to us via HTTP POST.

WordPress.com, Blogger, and many CMS platforms support WebSub natively. For publishers that support it, latency drops from minutes (polling interval) to seconds (push notification).

Use WebSub as the preferred path, fall back to polling for publishers that do not support it. This is the same hybrid push/pull architecture used by podcast aggregators and many real news platforms.

Component 2: Deduplication with SimHash

This is the component that makes or breaks the product. Without it, the feed is unusable.

Three levels of dedup:

Level 1: Exact URL dedup (trivial)

Hash the canonical URL. If we have seen it, skip it. This catches the same article crawled from the same URL twice. It misses the same content at different URLs (syndication).

Note: RSS guid is optional and not always unique across publishers. Do not rely on it as the sole dedup key. Use URL_hash + content_fingerprint as fallback identity.

Level 2: Near-duplicate detection with SimHash (the core technique)

SimHash (Charikar, 2002) produces a 64-bit fingerprint where Hamming distance correlates with text similarity. Two articles with Hamming distance <= 3 (out of 64 bits) are likely near-duplicates.

How SimHash works:

Tokenize the article text into shingles (3-word sequences).
Hash each shingle to a 64-bit value.
For each bit position, if the shingle hash has a 1, add +1 to that position's accumulator. If 0, add -1.
The SimHash is the 64-bit value where each bit is 1 if the accumulator is positive, 0 otherwise.

Why it works: Similar documents share most shingles, so their accumulators are similar, producing similar SimHash values. The Hamming distance between SimHashes correlates with the Jaccard distance between the original shingle sets.

Efficient lookup: To find near-duplicates (Hamming distance <= 3) without comparing against every existing article:

Divide the 64-bit SimHash into 4 blocks of 16 bits.
For each article, store all 4 blocks in a hash table.
To check a new article, look up each of its 4 blocks. Any match is a candidate near-duplicate.
For candidates, compute the full Hamming distance.

This reduces comparison from O(N) per article to O(1) per article (hash table lookups). At 1M articles/day, the dedup check adds negligible overhead.

Google used SimHash for web search deduplication. It is directly applicable to news.

Level 3: Story clustering (different articles, same event)

After dedup, different articles about the same event should be grouped:

Extract entities from each article (people, places, organizations) using NLP.
Compare against existing story clusters from the last 48 hours.
Similarity function: 0.4 * entity_overlap + 0.3 * headline_similarity + 0.2 * temporal_proximity + 0.1 * source_category_match.
If similarity > 0.7, add to existing cluster. Otherwise, create new cluster.

Lead article selection: Within a cluster, rank by source_authority * freshness * depth * uniqueness. Show the lead article as the main headline, other articles as "See also from 47 sources."

Component 3: The Legal Content Model (Snippets Only)

Store only metadata per article:

CREATE TABLE articles (
    id             BIGSERIAL PRIMARY KEY,
    url            TEXT NOT NULL UNIQUE,
    url_hash       BIGINT NOT NULL,           -- for fast dedup
    simhash        BIGINT NOT NULL,           -- for near-duplicate detection
    title          VARCHAR(500) NOT NULL,
    snippet        VARCHAR(500),              -- first 200 chars or RSS description
    thumbnail_url  TEXT,
    publisher_id   INTEGER REFERENCES publishers(id),
    published_at   TIMESTAMP NOT NULL,
    ingested_at    TIMESTAMP DEFAULT NOW(),
    category       VARCHAR(50),
    region         VARCHAR(50),
    story_cluster_id INTEGER,
    language       VARCHAR(10) DEFAULT 'en',
    entity_tags    TEXT[]                     -- extracted entities for clustering
);

NO full_text column. The article text is used transiently for dedup/clustering during ingestion, then discarded. The user clicks through to the publisher's site to read the article. This is legally safe and dramatically reduces storage: 2 KB per article (metadata) vs 20-50 KB (full text). 30-day storage drops from 600 GB-1.5 TB to 60 GB.

Component 4: Cursor-Based Pagination

Offset-based pagination breaks when new articles are constantly being inserted at the top:

User loads page 1 (articles 1-20). New article inserted.
User loads page 2 (articles 21-40). But articles shifted: article 20 is now article 21.
User sees article 20 again on page 2.

Cursor-based pagination: Each response includes a cursor pointing to the last article seen. The next request asks for articles after that cursor.

Cursor implementation: Use the article's published_at timestamp + id as a composite cursor. Since multiple articles can share the same timestamp, the ID serves as a tiebreaker.

GET /feed?region=us&topic=tech&cursor=2026-04-17T14:30:00Z_12345&limit=20

The query becomes:

SELECT * FROM articles
WHERE region = 'us' AND category = 'tech'
  AND (published_at, id) < ('2026-04-17T14:30:00Z', 12345)
ORDER BY published_at DESC, id DESC
LIMIT 20;

Alternatively, use ULIDs (Universally Unique Lexicographically Sortable Identifiers) as article IDs. ULIDs encode a timestamp in the first 48 bits, so sorting by ULID is equivalent to sorting by time. The cursor is just the ULID of the last article: GET /feed?cursor=01H2X3Y4Z5...&limit=20.

Component 5: Breaking News Detection

A breaking news event creates a measurable signal:

Volume spike: Multiple publishers write about the same topic within a 5-15 minute window.
Detection: Maintain a sliding window counter per entity/topic. If the count exceeds 3x the rolling average for that time-of-day, flag as "trending."
Example: Normal hour sees 5 articles mentioning "earthquake." If 30 articles mention "earthquake" in 10 minutes, that is a 6x spike -- breaking news.

Response actions:

Action	Purpose
Accelerate polling	Increase all Tier 1 publishers from 5-min to 1-min intervals temporarily
Cache invalidation	Force-refresh regional feed caches so breaking news appears within 30 seconds
Story promotion	Breaking story cluster gets a score boost, appearing at top regardless of personalization
Push notification	Users who opted in for breaking alerts get a notification
Cool-down	After 2+ hours, return to normal polling and ranking

Component 6: Feed Caching with CDC Invalidation

Pre-compute regional feeds in Redis sorted sets:

feed:us:tech -> sorted set of article IDs, scored by published_at
feed:us:sports -> ...
feed:eu:politics -> ...

Each sorted set holds the latest 2,000 articles for that (region, topic) pair. ZREVRANGE returns a page of results.

Cache invalidation via CDC: When a new article is inserted into Postgres, a CDC pipeline (Debezium on the Postgres WAL) publishes the change to Kafka. A cache worker reads the change and adds the article to the appropriate Redis sorted set(s).

This is better than TTL-based invalidation (which causes thundering herd when caches expire simultaneously) and better than application-level dual writes (which are fragile under failure).

Cache sizing:

10 regions * 25 categories * 2000 articles * 100 bytes = 50 MB
With personalization vectors: 100M users * 200 bytes = 20 GB
Total Redis: ~21 GB (single large instance + replicas)

Deep Dives

News Aggregation — News Aggregation Clustering

Deep Dive 1: MinHash as an Alternative to SimHash

SimHash produces a single 64-bit fingerprint. MinHash produces a signature of k hash values and estimates Jaccard similarity between document shingle sets.

MinHash algorithm:

Create k hash functions.
For each document, compute the minimum hash value across all shingles for each hash function.
The MinHash signature is the vector of k minimum values.
The fraction of matching MinHash values between two documents approximates their Jaccard similarity.

Locality-Sensitive Hashing (LSH) with MinHash: Band the MinHash signature into b bands of r rows each. Hash each band. Documents that hash to the same bucket in any band are candidate near-duplicates. This enables sub-linear-time nearest-neighbor search.

SimHash vs MinHash:

Aspect	SimHash	MinHash
Fingerprint size	64 bits (fixed)	k * 64 bits (configurable)
Similarity metric	Cosine similarity	Jaccard similarity
Accuracy	Good for long documents	Better for short documents
Space	8 bytes per document	k * 8 bytes per document
Lookup	Hamming distance (bit manipulation)	LSH (hash table)
Used by	Google (web dedup)	Many document systems

For news articles (300-1000 words): SimHash is the better choice. It is simpler, uses constant space (8 bytes per article), and the 64-bit Hamming distance check is a single CPU instruction (POPCNT). MinHash is better for shorter documents (tweets, comments) where Jaccard similarity is more meaningful.

Deep Dive 2: Conditional GET Saves 60-80% Bandwidth

Without conditional GET, every poll downloads the full feed (average ~50 KB). With 78,250 polls/hour, that is 3.9 GB/hour of bandwidth -- most of it wasted on unchanged feeds.

How it works:

First request returns the feed plus ETag: "abc123" and Last-Modified: Thu, 17 Apr 2026 14:30:00 GMT headers.
Subsequent requests include If-None-Match: "abc123" and If-Modified-Since: Thu, 17 Apr 2026 14:30:00 GMT.
If the feed has not changed, the server returns 304 Not Modified (no body).

Savings: Most feeds do not change between polls. At a 70% conditional-hit rate, bandwidth drops from 3.9 GB/hour to 1.2 GB/hour. More importantly, the publisher's server does less work (just an ETag comparison instead of generating the full feed), which is polite and reduces the chance of being rate-limited or blocked.

Store (publisher_id, feed_url, last_etag, last_modified) in a small Postgres table. Update after each successful poll.

Deep Dive 3: Feed Diversity Enforcement

Without diversity constraints, a user's feed might show 10 CNN articles in a row simply because CNN published the most recently. This is a poor user experience.

Diversity rules:

Maximum 2-3 articles from the same publisher per page of 20 results.
Ensure at least 3 different categories appear on the first page.
Breaking news overrides diversity constraints (show the breaking story regardless).

Implementation: After retrieving the top-40 articles by score, apply a greedy diversification pass:

diversified_feed = []
publisher_counts = {}
category_counts = {}

for article in top_40_by_score:
    if publisher_counts[article.publisher] >= 3:
        continue
    if len(diversified_feed) < 5 and category_counts[article.category] >= 2:
        continue  # ensure variety in the first few results
    diversified_feed.append(article)
    publisher_counts[article.publisher] += 1
    category_counts[article.category] += 1
    if len(diversified_feed) == 20:
        break

This sacrifices strict score ordering for a better user experience. The over-fetch (40 candidates for 20 results) ensures enough diversity candidates.

Alternative Designs

Alternative 1: Fan-Out-On-Write (Wrong for This Problem)

Some candidates propose the Twitter fan-out model: when a new article is ingested, push it to the feed of every user who might be interested.

Why this is wrong: There is no follow graph. Users do not "follow" publishers (or if they do, it is a minor feature). The primary feed is by region and topic, not by subscription. Fan-out-on-write to 100M users for every article (even filtered by region) is massive write amplification for a system with only 12 articles/sec ingestion. Read-based serving (query at read time from pre-computed regional feeds) is far simpler and sufficient.

Alternative 2: Elasticsearch for Everything

Use ES for both article storage and search. Ingest articles into ES, query with filters (region, category, date range), sort by freshness or relevance.

When this makes sense: If full-text search is a primary feature (users search for specific topics or keywords). For a browse-based feed (users scroll through categorized lists), Redis sorted sets are simpler and faster.

Alternative 3: Skip Dedup, Use Human Curation

Some news aggregators (Apple News, Flipboard) rely on editorial curation instead of automated dedup. Human editors select stories and group them manually.

Why this does not scale: 1M articles/day from 50,000 publishers. Even with a 100-person editorial team, each editor would need to review 10,000 articles/day. Automated dedup + clustering is the only viable approach at this scale. Editorial curation can supplement automation (e.g., Google News Showcase) but cannot replace it.

Scaling Math

Ingestion Pipeline

Polls/sec:               22 (trivial for a single server)
Articles ingested/sec:   12
SimHash computation:     ~1 ms per article (tokenize + hash)
SimHash lookup:          ~0.01 ms per article (hash table lookup)
Story clustering:        ~5 ms per article (entity comparison against ~200K active clusters)
Total ingestion latency: ~10 ms per article
Single server handles:   100 articles/sec (8x headroom)

Serving

Feed QPS (peak):         36,000
Redis throughput:        single instance handles 100K+ ops/sec
Replicas needed:         1 (with a backup for availability)
Cache hit rate:          ~85% (same region/topic queries)
Postgres QPS:            36,000 * 0.15 = 5,400 (cache misses, easily handled by 2 replicas)

Storage (30-day retention)

Articles:                1M/day * 30 * 2 KB = 60 GB
SimHash index:           30M articles * 8 bytes = 240 MB
Story clusters:          200K clusters * 1 KB = 200 MB
Publisher metadata:      50K * 500 bytes = 25 MB
User preferences:        100M * 200 bytes = 20 GB (Redis)
Total Postgres:          ~62 GB (single instance with 128 GB RAM)

Failure Analysis

Failure	Impact	Mitigation
Publisher RSS feed goes down	No new articles from that publisher.	Circuit breaker per publisher. After 5 consecutive failures, stop polling temporarily. Retry with exponential backoff. One broken publisher does not affect others.
SimHash service fails	Duplicate articles enter the feed.	Queue articles for later dedup processing. Users see some duplicates temporarily (annoying but not catastrophic). Fix and reprocess the queue.
Story clustering service fails	Articles appear as individual items instead of grouped clusters.	Degrade to showing unclustered articles. Clustering can be applied retroactively when the service recovers.
Postgres primary goes down	No new articles ingested. Reads continue from replicas.	Automated failover within 30 seconds. During outage, articles queue in Kafka (72-hour retention). Replay after recovery.
Redis cache goes down	All feed requests hit Postgres directly. Latency increases from <1ms to 5-10ms.	Redis Sentinel for failover. Postgres can handle the load -- just slower.
Breaking news overwhelms the system	Spike in publisher activity + user traffic simultaneously.	Breaking news detection accelerates polling and invalidates caches. Auto-scale API servers. Redis absorbs the read spike.
Publisher blocks our crawler	Loss of content from that publisher.	Identify the crawler clearly (User-Agent). Respect `robots.txt`. Maintain good relationships with publishers. Fall back to WebSub if crawling is blocked.
Stale feed after cache invalidation failure	Users see old articles at the top.	Dual path: CDC-based invalidation + TTL-based fallback (5-minute TTL). If CDC fails, TTL ensures eventual freshness.

Level Expectations

Level	What the Interviewer Expects
Mid (L4)	RSS polling, Postgres storage, basic API with pagination. Sort by recency. Mention caching. This passes but misses the point.
Senior (L5)	Everything above plus: Tiered polling with adaptive frequency. Cursor-based pagination (not offset). Redis sorted sets for feed caching. CDC-based cache invalidation. Mention deduplication as a concern. WebSub for push-based ingestion. Conditional GET for bandwidth savings.
Staff (L6)	Everything above plus: SimHash for near-duplicate detection with algorithmic explanation. Story clustering with entity extraction. Legal content model (snippets only, cite the AFP lawsuit or EU Copyright Directive). Breaking news detection via volume spike detection. Feed diversity enforcement. MinHash vs SimHash comparison. Quantified polling math (22 polls/sec, conditional GET saves 60-80%). Cursor implementation with ULIDs or composite timestamps.

References from Our Courses

Inverted Index Internals — full-text indexing of news articles for search
Kafka Partitions and Ordering — ingesting article feeds from multiple sources
Bloom Filters — deduplication of already-seen articles

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a News Aggregation Platform →