Design a Web Crawler

TL;DR

A web crawler downloads web pages at scale, extracts links, and follows them to discover more pages. The interesting engineering has nothing to do with requests.get(). It is the URL frontier (how you decide what to crawl next), politeness (how you avoid hammering individual domains), deduplication (how you avoid re-crawling the same page), and DNS resolution (which, unintuitively, is the single biggest bottleneck -- it accounts for 70% of thread time in naive implementations). Google's Mercator paper and the IRLbot paper are the canonical references. Crawling 10 billion pages in 5 days means processing 23,000 pages per second, and at that rate, every microsecond of waste multiplies to days of additional crawl time.

The System

A web crawler starts with a set of seed URLs, fetches those pages, extracts all hyperlinks from the HTML, adds the new URLs to a queue, and repeats. The result is a corpus of downloaded pages (stored for indexing, analysis, or archival) and a graph of link relationships between pages.

Google's original crawler (1998) fetched about 200 pages per second on a handful of machines. By 2008, Google was crawling over 1 trillion unique URLs. The Internet Archive's Heritrix crawler archives billions of pages for the Wayback Machine. Common Crawl, an open dataset, contains petabytes of web data crawled from 3.5+ billion pages per monthly crawl. At this scale, the crawler is not a script -- it is a distributed system with its own storage layer, scheduling algorithms, and failure recovery mechanisms. Scrapy, the popular Python framework, handles single-machine crawling well but was never designed for billion-page scale.

Requirements

Functional

Crawl web pages: Given seed URLs, recursively discover and download pages by following hyperlinks
Respect robots.txt: Obey crawl directives from site owners (disallow rules, crawl-delay)
URL normalization: Treat equivalent URLs as the same (e.g., http://example.com and http://example.com/ and HTTP://EXAMPLE.COM)
Content storage: Store downloaded pages with metadata (URL, fetch time, HTTP status, content hash)
Recrawl scheduling: Revisit pages that change frequently more often than static pages
Priority crawling: Crawl "important" pages first (higher PageRank, more inlinks, fresher content)

Non-Functional

Throughput: 10 billion pages in 5 days = 23,148 pages/sec sustained
Politeness: No more than 1 request per second per domain (configurable per robots.txt crawl-delay)
Storage: Average page size 100 KB after compression. 10B pages = 1 PB of raw content
Deduplication: Detect and skip duplicate URLs and duplicate content (near-duplicate pages across different URLs)
Robustness: Handle spider traps (infinite URL spaces), timeouts, malformed HTML, and server errors without crashing
Distributed: Scale horizontally by adding more crawler machines

Back-of-Envelope Math

Throughput:
  10B pages / 5 days = 10,000,000,000 / 432,000 seconds = 23,148 pages/sec
  With 100 crawler machines: 232 pages/sec per machine
  Each page fetch: DNS (50ms) + TCP connect (50ms) + HTTP request (200ms) = 300ms
  At 300ms per page, need 232 * 0.3 = ~70 concurrent threads per machine
  With connection pooling and async I/O: feasible

Storage:
  10B pages * 100 KB (compressed) = 1 PB
  S3 at $0.023/GB/month: 1 PB = $23,000/month
  HDFS on 50 nodes with 20 TB each = 1 PB. More cost-effective for batch access.

URL frontier:
  Discovered URLs (not yet crawled): could be 100B+ (10x the crawled set)
  Each URL: average 80 bytes
  100B * 80 bytes = 8 TB
  Does NOT fit in memory. Must be on disk with efficient access patterns.

DNS:
  23K pages/sec with unique domains: up to 23K DNS lookups/sec
  Public DNS resolvers (8.8.8.8): ~50ms per lookup
  23K * 50ms = 1,150 seconds of DNS time per second
  Need: local DNS cache + dedicated DNS resolver infrastructure

Bandwidth:
  23K pages/sec * 500 KB (uncompressed average) = 11.5 GB/sec
  With compression (Accept-Encoding: gzip): ~100 KB/sec * 23K = 2.3 GB/sec
  Per machine (100 machines): 23 MB/sec = 184 Mbps. Manageable.

The Naive Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  URL Queue   │────>│   Crawler    │────>│  Storage     │
│  (in-memory) │<────│  (single)    │     │  (disk)      │
└──────────────┘     └──────────────┘     └──────────────┘

queue = deque(seed_urls)
visited = set()

while queue:
    url = queue.popleft()
    if url in visited:
        continue
    visited.add(url)
    html = fetch(url)
    save(url, html)
    for link in extract_links(html):
        normalized = normalize(link)
        if normalized not in visited:
            queue.append(normalized)

This is a breadth-first crawl on a single machine. It discovers pages, stores them, and avoids revisiting the same URL. For crawling a single website of 10,000 pages, this works.

Where Does This Break First?

The visited set. At 10 billion URLs, storing 80-byte URLs in a hash set requires ~800 GB of RAM plus hash table overhead. You cannot keep the visited set in memory. But even before that, the single-threaded fetcher is the bottleneck: at 300ms per page, you crawl 3.3 pages/sec. At that rate, 10 billion pages takes 96 years.

Where It Breaks

Problem 1: DNS is 70% of crawl time. This is the thing nobody expects. In a naive crawler, every fetch(url) call starts with a DNS lookup. DNS lookups to public resolvers take 20-100ms. If you are crawling 23K pages/sec from 23K different domains, that is 23K DNS lookups/sec, each blocking a thread for 50ms. The total "DNS wait time" across all threads is 1,150 thread-seconds per wall-clock second. DNS resolution, not HTTP fetching, is the dominant cost.

Problem 2: Politeness is an architectural constraint, not a feature. You cannot send 100 requests/sec to wikipedia.org just because your queue has 100 Wikipedia URLs at the front. The robots.txt Crawl-delay directive (and basic decency) requires spacing requests to the same domain. This means the frontier must be organized by domain, not just by priority. A naive FIFO queue ignores this entirely.

Problem 3: URL deduplication at 100 billion URLs. The discovered URL set (URLs you have seen, whether or not you have crawled them) can be 10-100x the crawled set. Storing 100 billion URLs in a hash set is impossible. You need a probabilistic data structure (Bloom filter) or an on-disk dedup strategy.

Problem 4: Spider traps. Some websites generate infinite URLs. A calendar page might have links for every day into the future: /calendar/2025/04/18, /calendar/2025/04/19, ..., /calendar/3025/04/18. A crawler that blindly follows all links will get stuck generating trillions of URLs from a single domain. URL path depth limits and per-domain URL budgets are essential.

Problem 5: Content deduplication. The same article may be accessible at 5 different URLs (with/without www, with/without trailing slash, with different query parameters, mirrored on multiple domains). URL normalization catches some of these, but not all. You need content-level dedup (hash the page body) to avoid storing 5 copies of the same content.

The Real Design

                    ┌──────────────────────────────────────┐
                    │         Seed URLs / Re-crawl List     │
                    └────────────────┬─────────────────────┘
                                     │
                                     v
         ┌──────────────────────────────────────────────────┐
         │              URL Frontier (Mercator)              │
         │  ┌────────────────────┐  ┌────────────────────┐  │
         │  │  Front Queues      │  │  Back Queues       │  │
         │  │  (priority-based)  │──│  (per-domain       │  │
         │  │  F1: high priority │  │   politeness)      │  │
         │  │  F2: medium        │  │  B1: wikipedia.org │  │
         │  │  F3: low           │  │  B2: reddit.com    │  │
         │  │                    │  │  B3: nytimes.com   │  │
         │  └────────────────────┘  └────────────────────┘  │
         └──────────────────────┬───────────────────────────┘
                                │
                   ┌────────────┼────────────┐
                   │            │            │
          ┌────────v──┐  ┌─────v─────┐  ┌───v────────┐
          │ Fetcher 1  │  │ Fetcher 2 │  │ Fetcher N  │
          │ (thread    │  │           │  │            │
          │  pool)     │  │           │  │            │
          └────────┬───┘  └─────┬─────┘  └───┬────────┘
                   │            │            │
                   v            v            v
         ┌──────────────────────────────────────────────────┐
         │                  DNS Cache                        │
         │  (local resolver + TTL-based cache)               │
         └──────────────────────────────────────────────────┘
                                │
                                v
         ┌──────────────────────────────────────────────────┐
         │              Content Processing                   │
         │  HTML parse → link extract → URL normalize        │
         │       → content dedup (SimHash) → store           │
         └────────────────┬─────────────────────────────────┘
                          │
              ┌───────────┼───────────┐
              │           │           │
     ┌────────v──┐  ┌─────v─────┐  ┌──v───────────┐
     │  URL Seen │  │  Content  │  │  Page Store   │
     │  Filter   │  │  Store    │  │  (S3/HDFS)    │
     │  (Bloom)  │  │  (hashes) │  │               │
     └───────────┘  └───────────┘  └───────────────┘

Mercator Two-Level Frontier

The Mercator frontier (named after Google's Mercator crawler) separates URL scheduling into two stages: what to crawl (priority) and when to crawl it (politeness). This separation is the single most important architectural decision in a web crawler.

Front queues (priority)

URLs enter one of F front queues based on priority. Priority is determined by:

PageRank or estimated page importance
How recently the page was modified (fresher = higher priority)
URL depth from seed (shallower = higher priority)
Explicit priority from the crawl configuration

F1 (high):   Homepage of top-1000 domains, known news sources
F2 (medium): Internal pages of top-10K domains, pages with many inlinks
F3 (low):    Deep pages, long-tail domains, pages with few inlinks

A front-queue selector picks from F1 with 60% probability, F2 with 30%, F3 with 10%. This ensures high-priority pages get crawled first while making progress on lower-priority pages.

Back queues (politeness)

There are B back queues, one per domain (or per IP, depending on configuration). Each back queue has a "do not crawl before" timestamp.

B_wikipedia.org: [url1, url2, url3] | next_allowed = 12:00:05
B_reddit.com:    [url4, url5]       | next_allowed = 12:00:03
B_nytimes.com:   [url6]             | next_allowed = 12:00:07

When a fetcher thread is ready, it picks the back queue with the earliest next_allowed time (min-heap over back queues). It dequeues one URL, fetches it, and sets next_allowed = now + crawl_delay (default 1 second, or whatever robots.txt specifies).

Mapping from front to back: When a URL exits a front queue, it is routed to the back queue for its domain. If no back queue exists for that domain, one is created.

Why this two-level design matters: It decouples priority from politeness. A high-priority URL for a slow-to-crawl domain (crawl-delay: 10s) does not block crawling of lower-priority URLs on faster domains. The fetcher threads are always busy -- they grab whichever domain is "ready" next, not whichever URL is highest priority.

DNS: The Hidden Bottleneck

DNS resolution is the #1 performance killer in naive web crawlers. Here is why and how to fix it.

The problem quantified: At 23K pages/sec, assuming 50% of URLs are on new domains (the rest share domains with already-cached entries), you need ~11.5K DNS lookups/sec. A single DNS recursive resolver handles about 10K queries/sec. You need multiple resolvers, and you need aggressive caching.

Solution: Dedicated DNS infrastructure

Layer 1: In-memory DNS cache (per crawler machine)
  - 10M entries * 100 bytes = 1 GB
  - Cache hit rate: ~70% (many URLs share domains)
  - TTL: respect DNS TTL, minimum 300 seconds

Layer 2: Local recursive resolver (e.g., Unbound)
  - Runs on each crawler machine
  - Caches resolved records locally
  - Handles 10K+ queries/sec per instance

Layer 3: Authoritative DNS pre-fetch
  - Batch-resolve domains from the frontier before they are needed
  - A background thread scans upcoming URLs in back queues
  - Pre-fetches DNS for domains that will be crawled in the next 60 seconds

DNS pre-fetching: This is the key optimization. While fetcher threads are downloading pages, a separate DNS thread resolves domains for URLs that are next in the back queues. By the time the fetcher is ready for the next URL, the DNS record is already cached. This removes DNS from the critical path entirely.

The IRLbot paper (Texas A&M, 2009) measured that DNS pre-fetching improved crawler throughput by 3-4x compared to on-demand resolution. Google's crawling infrastructure runs its own DNS resolvers, pre-warmed with the domains it plans to crawl in the next batch.

Bloom Filter for URL Deduplication

At 10 billion URLs, you cannot store every URL in a hash set. A Bloom filter provides approximate set membership testing with controllable false positive rate and fixed memory.

Bloom filter sizing:

n = 10 billion URLs (items)
p = 1% false positive rate (1 in 100 unseen URLs incorrectly marked as seen)

Optimal number of bits: m = -n * ln(p) / (ln(2))^2
  m = -10B * ln(0.01) / (0.693)^2
  m = -10B * (-4.605) / 0.480
  m = 95.9 billion bits = 12 GB

Optimal number of hash functions: k = (m/n) * ln(2)
  k = (95.9B / 10B) * 0.693 = 6.6 -> 7 hash functions

12 GB for 10 billion URLs at 1% false positive rate. This fits in memory on a single machine. The 1% false positive rate means 1% of never-seen URLs will be incorrectly skipped -- you lose 1% of the web. For a search engine crawler, this is acceptable. For a completeness-critical archival crawler, use 0.1% (which costs 16 GB).

The false positive trade-off: A false positive means "this URL was probably already crawled" when it was not. The page is skipped. A false negative is impossible with Bloom filters -- if the filter says "not seen," the URL has definitely not been seen. This is the right error direction: you might miss a page (false positive), but you will never crawl the same page twice.

Scaling beyond 10B URLs: At 100B URLs, the Bloom filter is 120 GB. This does not fit on one machine. Partition the Bloom filter by URL hash: URLs hashing to 0-9 go to partition 0, 10-19 to partition 1, etc. Each partition is queried independently. This adds a network hop but keeps each partition under 12 GB.

URL Normalization

Two URLs that look different might point to the same page. If you do not normalize, you crawl the same page multiple times.

The 9 normalization techniques:

1. Lowercase the scheme and host
   HTTP://Example.COM/Page -> http://example.com/Page

2. Remove default port
   http://example.com:80/page -> http://example.com/page
   https://example.com:443/page -> https://example.com/page

3. Remove trailing slash on root
   http://example.com/ -> http://example.com

4. Percent-encode special characters consistently
   http://example.com/a b -> http://example.com/a%20b

5. Decode unreserved percent-encoded characters
   http://example.com/%7Euser -> http://example.com/~user

6. Remove fragment (#anchor)
   http://example.com/page#section2 -> http://example.com/page
   (Fragments are client-side only, servers never see them)

7. Sort query parameters
   http://example.com/search?b=2&a=1 -> http://example.com/search?a=1&b=2

8. Remove tracking parameters
   http://example.com/page?utm_source=twitter&utm_medium=social 
   -> http://example.com/page
   (Known tracking params: utm_*, fbclid, gclid, ref, source)

9. Canonicalize path (resolve . and ..)
   http://example.com/a/b/../c -> http://example.com/a/c

URL normalization is applied before the Bloom filter check. Without normalization, the Bloom filter would store http://example.com/ and http://example.com as two separate entries, and the crawler would fetch the page twice.

Robots.txt Compliance

Before crawling any URL on a domain, fetch and parse its robots.txt. Cache the result for the TTL specified in the HTTP response (or 24 hours as default).

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

User-agent: Googlebot
Allow: /admin/public/
Crawl-delay: 0.5

Key behaviors:

Check robots.txt once per domain, cache the result
Obey Crawl-delay (sets the minimum interval between requests to this domain)
Respect Disallow directives (do not crawl these paths)
Handle missing robots.txt (treat as "everything allowed")
Handle robots.txt errors (5xx response: assume disallow everything, retry later. 4xx response: assume everything allowed)

Spider traps and robots.txt: Some spider traps are intentional (honeypots that mark crawlers as malicious). Others are accidental (CMS generating infinite pagination). Beyond robots.txt, use these heuristics to detect traps:

URL path depth > 15: probably a trap
More than 10,000 URLs from a single domain: apply a per-domain budget
URLs with repetitive path patterns (/a/b/a/b/a/b/...): detect and skip
Pages with zero outlinks but identical structure to pages with links: likely generated content

Deep Dives

Web Crawler — Web Crawler High-Level Design

Deep Dive 1: Content Deduplication with SimHash

URL normalization catches syntactically equivalent URLs. But the same content can live at completely different URLs (mirrors, syndicated articles, scraped content). You need content-level deduplication.

Exact dedup: SHA-256 hash of page body

content_hash = sha256(page_body.encode()).hexdigest()
if content_hash in content_hash_set:
    skip  # exact duplicate

This catches exact duplicates. But many near-duplicates differ by a header, footer, sidebar ad, or timestamp. "Updated 5 minutes ago" vs "Updated 10 minutes ago" makes the entire SHA-256 hash different even though the actual content is identical.

Near-dedup: SimHash (Charikar's algorithm)

SimHash produces a 64-bit fingerprint such that similar documents have similar fingerprints. Two documents with Hamming distance <= 3 in their SimHash fingerprints are considered near-duplicates.

Document A: "the quick brown fox jumps over the lazy dog"
Document B: "the quick brown fox jumps over a lazy dog"
  (differs by 1 word out of 9)

SimHash(A) = 0x1A2B3C4D5E6F7081
SimHash(B) = 0x1A2B3C4D5E6F7089
Hamming distance: 1 bit (bit 3 differs)
Threshold: 3 bits -> these are near-duplicates

Practical implementation: Google reported using SimHash for near-duplicate detection in their crawler. At 10 billion pages, you need to efficiently find all existing fingerprints within Hamming distance 3 of a new fingerprint. Divide the 64-bit fingerprint into 4 blocks of 16 bits each. If two fingerprints have Hamming distance <= 3, at least one of their 4 blocks must be identical (pigeonhole principle). Store fingerprints in 4 tables indexed by each block. To check a new fingerprint, look up its 4 blocks and check Hamming distance against candidates.

Deep Dive 2: Recrawl Scheduling

Not all pages change at the same rate. News sites update every minute. A university professor's homepage updates once a year. Crawling both at the same frequency wastes resources.

Adaptive recrawl frequency:

Track how often a page's content hash changes between crawls. Adjust the recrawl interval based on observed change frequency.

def calculate_recrawl_interval(page):
    if page.change_count == 0:
        return min(page.current_interval * 2, MAX_INTERVAL)  # slow down

    change_rate = page.change_count / page.total_crawls
    if change_rate > 0.8:  # changed 80%+ of the time we checked
        return max(page.current_interval / 2, MIN_INTERVAL)  # speed up
    elif change_rate < 0.2:  # changed < 20% of the time
        return min(page.current_interval * 1.5, MAX_INTERVAL)  # slow down
    else:
        return page.current_interval  # keep current

Typical intervals: - News homepages: 5-15 minutes - Social media profiles: 1-6 hours - Blog posts: 1-7 days - Static pages (terms of service, about): 14-30 days

HTTP conditional requests: Use If-Modified-Since and If-None-Match (ETags) headers. If the server responds with 304 Not Modified, the page has not changed, and you save bandwidth (no body transferred) and processing time (no parsing/dedup needed). Support for conditional requests varies by server but is worth implementing -- it reduces bandwidth by 30-50% for recrawls.

Deep Dive 3: Distributed Architecture at Scale

At 23K pages/sec, you need 50-100 crawler machines working in parallel. The coordination challenges are significant.

Partitioning the URL space: Assign domains to crawler machines using consistent hashing on the domain name. All URLs from wikipedia.org go to the same crawler machine. This ensures politeness (one machine enforces the crawl delay for a domain) and simplifies the Bloom filter (each machine only checks URLs in its partition).

Shared vs. partitioned Bloom filter: Two options.

Option A: Each machine has its own Bloom filter for its URL partition. Size: 12 GB / 100 machines = 120 MB per machine. Simple, no coordination.

Option B: Centralized Bloom filter service that all machines query. More accurate (catches cross-partition duplicates from redirects) but adds a network hop.

I would go with Option A (partitioned) for most deployments. Cross-partition duplicates are rare (< 1% of URLs) and the simplicity is worth the slight duplication.

Frontier distribution: The Mercator frontier runs on each machine for its domain partition. When a page from wikipedia.org (assigned to machine 5) contains a link to reddit.com (assigned to machine 12), machine 5 sends the URL to machine 12's frontier via an internal message queue (Kafka, or simple TCP).

Machine 5 crawls wikipedia.org/page
  Finds link: https://reddit.com/r/programming
  Sends to Kafka topic: "frontier-machine-12"
Machine 12 consumes from "frontier-machine-12"
  Checks Bloom filter: not seen
  Adds to frontier for reddit.com back queue

Failure recovery: If a crawler machine dies, its domains need to be reassigned. With consistent hashing, they go to the next machine on the ring. The new machine does not have the dead machine's Bloom filter, so it may re-crawl some URLs. This is acceptable -- re-crawling 1/100 of the URL space is better than losing an entire domain partition.

Alternative Designs

Alternative 1: Serverless Crawler (Lambda + SQS + DynamoDB)

Each URL is an SQS message. A Lambda function consumes it, fetches the page, extracts links, and publishes new URLs back to SQS. DynamoDB stores the URL seen set.

Alternative 2: MapReduce-Based Batch Crawl

Divide seed URLs into chunks. Map phase: fetch all URLs in a chunk. Reduce phase: extract links, dedup against the seen set, produce the next batch of URLs. Repeat for N iterations.

Alternative 3: Headless Browser Crawler (for JavaScript-Heavy Sites)

Replace the HTTP fetcher with headless Chrome (Puppeteer/Playwright). Necessary for SPAs where content is rendered client-side. Much slower (2-5 seconds per page vs. 200ms for HTTP) but sees the actual rendered content.

Aspect	Mercator (distributed)	Serverless (Lambda+SQS)	MapReduce Batch	Headless Browser
Throughput	23K pages/sec	~5K pages/sec	10K+ pages/sec (batch)	~500 pages/sec
Politeness control	Precise (back queues)	Hard (Lambda concurrency)	Batch-level only	Precise
Cost (10B pages)	~$50K (infra)	~$200K (Lambda charges)	~$80K (compute hours)	~$500K (slow, many CPUs)
JavaScript rendering	No	No	No	Yes
Operational complexity	High (custom system)	Low (managed services)	Medium (Hadoop/Spark)	High (Chrome fleet)
Recrawl scheduling	Built-in (frontier)	External scheduler	New MapReduce job	Built-in

The Mercator design is the right answer for most crawling interviews. Mention headless browser if the interviewer asks about JavaScript-rendered content (most modern SPAs). The serverless approach is surprisingly expensive at scale because Lambda pricing is per-invocation, and 10 billion invocations adds up fast.

Scaling Math Verification

23K pages/sec across 100 machines:

Per machine: 231 pages/sec
With 100 threads per machine, each thread handles 2.31 pages/sec
Each page takes ~300ms (with DNS caching): thread utilization = 2.31 * 0.3 = 69%. Healthy.
Per machine bandwidth: 231 * 100 KB = 23.1 MB/sec = 185 Mbps. 1 Gbps NIC handles this.

Bloom filter accuracy:

12 GB for 10B URLs at 1% FPR
False positives: 10B * 0.01 = 100M URLs incorrectly skipped
100M / 10B = 1% of the web missed. Acceptable for search; might not be for archival.

Frontier size:

Active URLs in frontier: up to 100B (10x crawled set, from discovered but not-yet-crawled links)
100B * 80 bytes per URL = 8 TB
Must be disk-backed. BerkeleyDB or RocksDB for on-disk sorted queues.
Hot portion (domains due for crawling in next 60 seconds): ~1M URLs = 80 MB. Fits in memory.

DNS cache effectiveness:

10B pages from ~200M unique domains
Average 50 pages per domain
DNS cache holds 10M entries (1 GB)
After initial cold start: cache hit rate ~70% (many pages per popular domain)
DNS lookups needed: 23K * 0.3 = ~7K/sec. A single Unbound resolver handles 50K/sec. Fine.

Failure Analysis

Component	Current capacity	At 10x (100B pages)	Breaks?	Fix
Crawler fleet (100)	23K pages/sec total	230K pages/sec needed	Yes	Scale to 1000 machines
Bloom filter	12 GB (10B URLs)	120 GB (100B URLs)	Yes	Partition across machines, or switch to Count-Min Sketch
Storage (S3/HDFS)	1 PB	10 PB	Maybe	S3 scales automatically; HDFS needs more nodes
DNS resolution	7K lookups/sec	70K lookups/sec	Maybe	More Unbound instances, larger cache
Frontier (disk)	8 TB	80 TB	Yes	Distributed RocksDB, partition by domain hash
Inter-machine messaging	Kafka for URL exchange	10x message volume	No	Kafka scales horizontally
robots.txt cache	200M domains cached	2B domains	Yes	Distributed cache (Redis Cluster)

The first bottleneck at 10x is the fleet size and Bloom filter. Both are linear scale-outs -- more machines, partitioned filter. The harder problem at 100B pages is the frontier: 80 TB of URLs to manage with priority and politeness constraints. At this scale, the frontier becomes its own distributed storage system (Google uses Bigtable for their crawler frontier).

The second bottleneck is bandwidth. At 230K pages/sec * 100 KB = 23 GB/sec. This requires careful network topology design and peering agreements with ISPs to avoid saturating upstream links.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
URL frontier	BFS queue	Priority queue with politeness delay	Mercator two-level frontier (front + back queues), explains why
Deduplication	Hash set of visited URLs	Mentions Bloom filter, calculates size	SimHash for near-dedup, Bloom filter sizing math, FPR trade-off
DNS	Not discussed	"DNS is slow, use a cache"	70% of thread time, pre-fetching, dedicated resolvers
Politeness	"Wait between requests"	Per-domain delay, robots.txt	Crawl-delay, per-domain back queues, spider trap detection
URL normalization	Lowercase hostname	Remove fragments, default ports	All 9 techniques, tracking parameter removal
Scalability	"Add more machines"	Partition by domain, explains coordination	Consistent hashing for domain assignment, frontier distribution
Content processing	Parse HTML, extract links	Content dedup with SHA-256	SimHash for near-dedup, conditional HTTP requests for recrawl
Real-world reference	"Like Googlebot"	Mentions Mercator paper	IRLbot paper, DNS measurement data, Google's use of SimHash

The single most important signal at any level: do you understand that a web crawler is bottlenecked by DNS and politeness, not by HTTP download speed or HTML parsing? Mid-level candidates optimize the fetcher. Senior candidates optimize the frontier. Staff+ candidates explain why the two-level Mercator frontier is the only architecture that scales to billions of pages.

References from Our Courses

Bloom Filters — deduplication of already-crawled URLs
RabbitMQ, Kafka, and SQS — URL frontier queue management and prioritization
Delivery Guarantees — ensuring every discovered URL is eventually crawled

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Web Crawler →