Skip to content

Design a Real-Time Geospatial Heatmap

TL;DR

A real-time geospatial heatmap aggregates millions of location events (ride requests, driver positions, delivery orders) into a colored map overlay that updates every 5-10 seconds. The naive approach writes every event to Redis and reads them all back for rendering -- at 1 million events per second, this saturates Redis. Flink pre-aggregation reduces Redis writes by 20x by grouping events into spatial-temporal buckets before storing them. H3 multi-resolution tiles let the system pre-compute heatmaps at different zoom levels (city overview at resolution 5, neighborhood detail at resolution 9). Hot partitions -- a single H3 cell in Manhattan receiving 100x the events of a rural cell -- are solved with CRDT G-Counters that allow each Flink worker to increment independently without coordination.

The System

A geospatial heatmap displays the density of events (ride requests, active drivers, food orders) on a map, using color intensity to show where activity is concentrated. Operations teams at Uber, Lyft, and DoorDash use heatmaps to monitor real-time supply and demand. Riders see surge pricing overlaid on the map. City planners use heatmaps of traffic density. Emergency services visualize incident density to allocate ambulances.

Uber's real-time heatmap processes location data from 1.25 million active drivers and millions of rider requests, updating every 5 seconds. The map shows "hot" zones (high demand, colored red) and "cold" zones (low demand, colored blue), enabling surge pricing decisions and driver incentivization. DoorDash uses heatmaps to identify areas with high order density but few drivers, triggering bonus pay to attract drivers to underserved areas. Google Maps' "busy" indicator for businesses is essentially a micro-heatmap powered by aggregated location data from Android phones.

Requirements

Functional

  • Event ingestion: Accept location events (lat, lng, event_type, timestamp) at high throughput
  • Spatial aggregation: Group events by geographic cell and compute density per cell
  • Multi-resolution rendering: Heatmap at city level (large cells, fewer details), neighborhood level (small cells, more detail), and street level (very small cells)
  • Temporal windowing: Show density over configurable time windows (last 5 minutes, last 30 minutes, last 1 hour)
  • Tile serving: Serve pre-computed heatmap tiles that map clients can overlay on a base map
  • Historical playback: Replay heatmap data for past time ranges (for analysis and planning)

Non-Functional

  • Ingestion throughput: 1 million events/sec from all sources (driver locations, ride requests, deliveries)
  • End-to-end latency: Event occurs -> heatmap updates within 5-10 seconds
  • Tile serving latency: Heatmap tile served to client in under 50ms (p99)
  • Freshness: Heatmap reflects the last 5 minutes of data by default
  • Spatial accuracy: H3 resolution 9 (170m cells) for street-level detail
  • Availability: 99.9% uptime for tile serving (operators rely on this during incidents)

Back-of-Envelope Math

Event volume:
  1M events/sec = 86.4B events/day
  Each event: lat (8B) + lng (8B) + type (1B) + timestamp (8B) + 
              metadata (25B) = ~50 bytes
  Raw throughput: 1M * 50 bytes = 50 MB/sec

Spatial cells (H3):
  Earth surface: 510M km^2
  H3 resolution 9: 0.11 km^2 per cell -> ~4.6B cells total
  Active cells (with at least 1 event in last 5 min): ~500K
  That's 0.01% of all cells. The map is very sparse.

Pre-aggregation (Flink):
  Without pre-aggregation:
    1M events/sec written to Redis = 1M Redis writes/sec
    Redis handles ~200K writes/sec per instance. Need 5 instances.

  With 10-second pre-aggregation windows:
    1M events in 10 seconds = 10M events
    Aggregated into 500K active cells = 500K aggregate records
    Written to Redis: 500K / 10 sec = 50K writes/sec
    That's 20x fewer writes. Single Redis instance handles this.

Tile serving:
  Map viewport at city level: ~1000 H3 cells visible
  Tile data: 1000 cells * 8 bytes (cell_id + count) = 8 KB per tile
  Tile requests: 100K active map clients * 1 tile request/5 sec = 20K/sec
  Bandwidth: 20K * 8 KB = 160 MB/sec (CDN-able)

Memory for active cells:
  500K cells * (cell_id 8B + count 4B + window_info 20B) = 16 MB
  Trivial. Fits in a small Redis instance.

The Naive Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Events      │────>│  API Server  │────>│  Redis       │
│  (locations) │     │              │     │  (geohash)   │
└──────────────┘     └──────────────┘     └──────────────┘
                                         ┌──────v───────┐
                                         │  Heatmap     │
                                         │  Renderer    │
                                         └──────────────┘

-- Every event:
GEOADD heatmap:5min lng lat event_id
EXPIRE heatmap:5min 300

-- Render heatmap:
events = GEORADIUS heatmap:5min center_lng center_lat 10km
for cell in grid:
    cell.density = count_events_in_cell(events, cell)
render_heatmap(grid)

Store every event in Redis GEO, then read them all back to render the heatmap. For 1,000 events/sec in a single city, this works.

Where Does This Break First?

At 1M events/sec, Redis gets 1M GEOADD writes per second. A single Redis instance handles ~200K/sec. Even with 5 instances, the GEORADIUS query that reads back all events in a 10km radius returns millions of points, taking seconds to serialize and transfer. The rendering step processes millions of individual events. Everything about this approach is O(N) where N is the number of raw events.

Where It Breaks

Problem 1: Writing every raw event to Redis is 20x more than necessary. The heatmap does not care about individual events. It cares about counts per cell. If 500 events occur in the same H3 cell in a 10-second window, you only need to write one count (500), not 500 individual records. Pre-aggregation in the streaming layer reduces writes from 1M/sec to ~50K/sec.

Problem 2: Reading raw events for rendering is O(N). To render a heatmap for a city, the naive approach reads millions of individual event coordinates from Redis, groups them by cell, and counts. This is O(N) per render, where N grows with event rate. Pre-aggregation makes rendering O(C) where C is the number of active cells (500K), independent of event rate.

Problem 3: Hot partitions in Manhattan. A single H3 cell in Midtown Manhattan might receive 10,000 events/sec while a cell in rural Wyoming receives 1/hour. If events are partitioned by H3 cell in Kafka, the Manhattan partition is overwhelmed while the Wyoming partition sits idle.

Problem 4: Zoom-level rendering is expensive. When the user zooms out from street level to city level, the heatmap should show larger, less granular cells. Computing this from resolution-9 cells on-the-fly requires aggregating thousands of small cells into each large cell -- for every tile request. Pre-computing at multiple zoom levels is essential.

The Real Design

┌──────────────────────────────────────────────────────────────┐
│                    Event Sources                              │
│  (driver GPS, ride requests, delivery orders)                │
└──────────────────────────┬───────────────────────────────────┘
                    ┌──────v───────┐
                    │    Kafka     │
                    │  (raw events,│
                    │   partitioned│
                    │   by H3 cell)│
                    └──────┬───────┘
                    ┌──────v───────────────────────────────────┐
                    │         Flink Pre-Aggregation            │
                    │  ┌─────────────────────────────────┐     │
                    │  │  Tumbling Window (10 seconds)   │     │
                    │  │  Group by: H3 cell + event_type │     │
                    │  │  Aggregate: count, sum, avg     │     │
                    │  │  Multi-resolution: res 5,7,9    │     │
                    │  └─────────────────────────────────┘     │
                    └──────┬───────────────────────────────────┘
                           │ 50K aggregates/sec (20x reduction)
              ┌────────────┼────────────────┐
              │            │                │
     ┌────────v──┐  ┌──────v──────┐  ┌──────v───────┐
     │  Redis    │  │  Kafka      │  │  ClickHouse  │
     │  (live    │  │  (aggregate │  │  (historical │
     │   state)  │  │   output)   │  │   playback)  │
     └────────┬──┘  └─────────────┘  └──────────────┘
     ┌────────v──────────────────────────────────────┐
     │              Tile Server                       │
     │  Reads pre-aggregated cells from Redis        │
     │  Renders heatmap tiles per zoom level         │
     │  Caches rendered tiles (5-second TTL)         │
     └────────────────┬──────────────────────────────┘
              ┌───────v──────┐
              │  CDN / Edge  │
              │  (cached     │
              │   tiles)     │
              └──────────────┘

This is the single most important optimization in the system. Instead of writing 1M raw events/sec to Redis, Flink aggregates them into 50K cell-count records per second.

# Flink pseudocode
@flink.process
def aggregate_events(events_stream):
    return (
        events_stream
        .key_by(lambda e: h3.latlng_to_cell(e.lat, e.lng, 9))
        .window(TumblingEventTimeWindows.of(Time.seconds(10)))
        .aggregate(
            CountAggregator(),     # count events per cell
            result_type=CellCount  # output: (cell_id, count, window_end)
        )
    )

class CellCount:
    cell_id: str       # H3 cell index
    count: int         # events in this cell in the 10-second window
    event_type: str    # "ride_request", "driver_location", etc.
    window_end: int    # timestamp of window end

Why tumbling windows (not sliding)? A tumbling window produces one output per window interval (every 10 seconds). A sliding window (e.g., 5-minute window sliding every 10 seconds) produces one output per slide interval but each output includes events from a 5-minute range, meaning events are counted in multiple window outputs. Tumbling windows are simpler and cheaper. The 5-minute heatmap is reconstructed at query time by summing the last 30 tumbling window outputs (30 * 10 seconds = 5 minutes).

Multi-resolution aggregation in the same Flink job:

@flink.process
def multi_resolution_aggregate(events_stream):
    for resolution in [5, 7, 9]:
        (events_stream
            .key_by(lambda e: h3.latlng_to_cell(e.lat, e.lng, resolution))
            .window(TumblingEventTimeWindows.of(Time.seconds(10)))
            .aggregate(CountAggregator())
            .add_sink(redis_sink(f"heatmap:res{resolution}")))

Resolution 5 (city level): 5.16 km^2 cells. Few cells, low storage. Resolution 7 (district level): 0.74 km^2 cells. Medium cells. Resolution 9 (street level): 0.11 km^2 cells. Many cells, detailed.

By pre-computing at all three resolutions, the tile server does not need to aggregate small cells into large ones at query time. It just reads the appropriate resolution for the requested zoom level.

H3 Multi-Resolution Tiles

The map client requests tiles at a specific zoom level. Each zoom level maps to an H3 resolution:

Map zoom level → H3 resolution → Cell size
Zoom 1-6   (continent/country) → H3 res 3 → 12.4 km edge
Zoom 7-10  (state/city)       → H3 res 5 → 3.2 km edge
Zoom 11-13 (district)         → H3 res 7 → 1.2 km edge  
Zoom 14-16 (neighborhood)     → H3 res 8 → 0.46 km edge
Zoom 17+   (street)           → H3 res 9 → 0.17 km edge

Tile request flow:

Client requests: GET /tiles/heatmap?zoom=12&x=3042&y=6876&window=5m
  1. Map zoom 12 → H3 resolution 7
  2. Convert tile coordinates to bounding box (lat/lng)
  3. Find all H3 res-7 cells within bounding box (~200 cells)
  4. Read counts from Redis: HMGET heatmap:res7:window5m cell1 cell2 ... cell200
  5. Render heatmap overlay with color intensity proportional to count
  6. Return PNG tile (or GeoJSON for vector rendering)

5-minute window from tumbling windows: The 5-minute heatmap sums the last 30 tumbling window outputs (each covering 10 seconds). Redis stores counts per cell per window:

Key: heatmap:res7:{cell_id}
Field: {window_timestamp}
Value: {count}

-- Get 5-minute count for a cell:
HMGET heatmap:res7:8729a12 
  "1714000000" "1714000010" "1714000020" ... "1714000290"
-- Sum the 30 values

Alternatively: Maintain a running 5-minute sum per cell using sliding window counters. On each 10-second aggregation output:

def update_cell_count(cell_id, new_count, window_ts):
    # Add new window's count
    redis.hincrby(f"heatmap:{cell_id}", "total", new_count)
    redis.hset(f"heatmap:{cell_id}", f"w:{window_ts}", new_count)

    # Remove oldest window (> 5 minutes ago)
    old_window = window_ts - 300  # 300 seconds = 5 minutes
    old_count = redis.hget(f"heatmap:{cell_id}", f"w:{old_window}") or 0
    redis.hincrby(f"heatmap:{cell_id}", "total", -int(old_count))
    redis.hdel(f"heatmap:{cell_id}", f"w:{old_window}")

The "total" field always contains the 5-minute rolling count. Reading the heatmap is a single HGET per cell for the "total" field.

Hot Partition Mitigation with CRDT G-Counters

A G-Counter (Grow-Only Counter) is a CRDT (Conflict-Free Replicated Data Type) where each node maintains its own counter, and the global count is the sum of all node counters. Nodes can increment independently without coordination.

Why This Matters: Hot Cells

A Midtown Manhattan cell receives 10,000 events/sec. If all Flink workers try to increment the same Redis key, you get write contention (100+ workers competing for one key lock). With G-Counters:

Flink Worker 1: local_count[manhattan_cell] += 347 events in 10-sec window
Flink Worker 2: local_count[manhattan_cell] += 289 events in 10-sec window
Flink Worker 3: local_count[manhattan_cell] += 412 events in 10-sec window
...

Each worker writes to its OWN Redis key:
  HINCRBY heatmap:manhattan_cell:worker1 "total" 347
  HINCRBY heatmap:manhattan_cell:worker2 "total" 289
  HINCRBY heatmap:manhattan_cell:worker3 "total" 412

Read: sum all worker keys for the cell.

No contention. Each worker writes to its own key partition. The tile server sums all worker keys when reading.

In practice: With 10 Flink workers, a hot cell has 10 Redis keys instead of 1. The read cost increases from 1 HGET to 10 HGETs per hot cell. At 200 hot cells in a viewport and 10 workers: 2,000 HGETs per tile request. Pipelined: one round trip. Under 1ms.

Push Deltas, Not Full Tiles

The map client polls for tile updates every 5 seconds. A full tile response for 1,000 cells is ~8 KB. If nothing changed in 90% of cells, you are transferring 7.2 KB of unchanged data.

Delta compression: Track what changed since the last response. Send only the changed cells.

// Full tile (initial load):
{
  "cells": [
    {"cell": "8729a12", "count": 342},
    {"cell": "8729a15", "count": 18},
    ...
  ],
  "version": 1714000300
}

// Delta (subsequent updates):
{
  "changes": [
    {"cell": "8729a12", "count": 387},  // count changed
    {"cell": "8729a19", "count": 5}     // new cell appeared
  ],
  "removed": ["8729a22"],               // cell no longer has events
  "version": 1714000310
}

Delta size: If 10% of cells change per 10-second interval, delta is 800 bytes instead of 8 KB. 10x bandwidth reduction.

WebSocket for push: Instead of the client polling, the server pushes deltas over a WebSocket connection. Each connected client subscribes to a viewport (set of H3 cells). When the Flink aggregation produces new counts, the tile server computes deltas and pushes to subscribed clients.

Client: SUBSCRIBE viewport lat1,lng1,lat2,lng2 zoom=12
Server: pushes delta every 10 seconds for cells in viewport

This eliminates polling entirely. The server pushes only when data changes.

Deep Dives

Geospatial Heatmap — Geospatial Heatmap High-Level Design

Deep Dive 1: 5-10 Second End-to-End Latency

The latency budget from event occurrence to heatmap update:

Event occurs (GPS reading)           : T+0
Event reaches Kafka (network + produce): T+0.5s
Flink ingests event                    : T+1s
Flink window closes (10-second tumble) : T+1 to T+11s (depends on when in window)
Flink writes aggregate to Redis        : T+11.5s
Tile server reads from Redis           : T+12s (on next refresh)
Client receives update                 : T+12.5s (WebSocket push) or T+17s (5-sec poll)

Worst case: event arrives at the start of a 10-second window.
  Window closes after 10 seconds. Write + read + push = ~2 more seconds.
  Total: ~12 seconds.

Best case: event arrives at the end of a window.
  Window closes immediately. Total: ~2 seconds.

Average: ~7 seconds.

Reducing latency: Use smaller tumbling windows (5 seconds instead of 10). This doubles the number of window outputs (and Redis writes) but halves the worst-case latency.

5-second windows:
  Redis writes: 50K/5sec = 100K writes/sec (still manageable)
  Worst-case latency: ~7 seconds
  Average: ~4.5 seconds

The trade-off is always window size vs. write amplification vs. latency. For most heatmap use cases, 5-10 second latency is acceptable. If sub-second latency is needed (e.g., real-time gaming), use continuous aggregation (no windowing) with per-event updates to Redis -- but this eliminates the 20x write reduction.

Deep Dive 2: Zoom-Level-Aware Pre-Computation

When a user zooms from city level to street level, the heatmap must smoothly transition from large cells to small cells. Without pre-computation at multiple resolutions, the tile server would need to aggregate thousands of resolution-9 cells into each resolution-5 cell for the zoomed-out view.

Pre-computation in Flink:

The Flink job simultaneously aggregates at 3 resolutions. Each resolution writes to its own Redis keyspace:

heatmap:res5:{window_ts} -> { cell_id: count, cell_id: count, ... }
heatmap:res7:{window_ts} -> { cell_id: count, cell_id: count, ... }
heatmap:res9:{window_ts} -> { cell_id: count, cell_id: count, ... }

Memory overhead: Each event is counted at 3 resolutions. But the total number of cells decreases sharply with resolution:

Resolution 9: ~500K active cells (0.11 km^2 each)
Resolution 7: ~10K active cells (0.74 km^2 each, 49 res-9 cells per res-7)
Resolution 5: ~200 active cells (5.16 km^2 each, ~500 res-9 cells per res-5)

Total active cells: 510.2K across all resolutions
Memory: 510K * 32 bytes = 16.3 MB

The overhead of multi-resolution pre-computation is negligible because higher resolutions have exponentially fewer cells.

Smooth zoom transitions: When the user zooms from level 10 (resolution 5) to level 12 (resolution 7), the client can:

Option A: Hard switch -- replace resolution-5 tiles with resolution-7 tiles. Simple but jarring.

Option B: Cross-fade -- load both resolution levels simultaneously and fade from one to the other. Smoother but requires loading 2x the tiles briefly.

Option C: Progressive loading -- show resolution-5 tiles immediately, then overlay resolution-7 tiles as they load. Best UX, most complex client-side implementation.

Deep Dive 3: Historical Playback and Analysis

Operations teams want to replay yesterday's heatmap to analyze patterns (e.g., "Where were we short on drivers at 6 PM?"). This requires storing aggregated data beyond the 5-minute live window.

ClickHouse for historical storage:

Flink writes aggregated cell counts to both Redis (live) and ClickHouse (historical).

CREATE TABLE heatmap_history (
    cell_id String,
    resolution UInt8,
    event_type String,
    window_start DateTime,
    count UInt32
)
ENGINE = MergeTree()
PARTITION BY toDate(window_start)
ORDER BY (resolution, cell_id, window_start);

Playback query: "Show the heatmap for resolution 7 on April 17, 2025 from 5 PM to 6 PM, 10-second granularity":

SELECT cell_id, window_start, sum(count) as total
FROM heatmap_history
WHERE resolution = 7
  AND window_start BETWEEN '2025-04-17 17:00:00' AND '2025-04-17 18:00:00'
  AND event_type = 'ride_request'
GROUP BY cell_id, window_start
ORDER BY window_start;

ClickHouse returns 360 time steps * 10K cells = 3.6M rows in under 1 second (column-oriented, compressed, partitioned by date).

Time-lapse animation: The client receives 360 frames of heatmap data and animates through them at 10x speed (1 hour replayed in 36 seconds). Each frame is a set of (cell_id, count) pairs rendered as a heatmap overlay.

Alternative Designs

Alternative 1: GeoHash Grid with Redis Sorted Sets

Use geohash cells instead of H3. Store counts in Redis sorted sets (score = count, member = geohash cell). ZRANGEBYSCORE to find high-density cells.

Alternative 2: PostGIS with Materialized Views

Store events in PostGIS with a spatial index. Materialized views compute density per grid cell. Refresh every 10 seconds.

Alternative 3: Client-Side Aggregation

Send raw events to the client via WebSocket. The client renders the heatmap in the browser using a library like heatmap.js or deck.gl. No server-side aggregation.

Aspect Flink + H3 + Redis GeoHash + Redis PostGIS + MatView Client-Side
Ingestion throughput 1M+ events/sec 200K events/sec 10K events/sec Unlimited (client)
End-to-end latency 5-10 seconds 1-5 seconds 10-30 seconds < 1 second
Server compute cost Flink cluster Redis only PostgreSQL Zero (client does it)
Multi-resolution Built into H3 Variable geohash length Custom grid Client zoom handling
Scale (events/sec) 1M+ 200K 10K ~10K per client
Client bandwidth ~8 KB per tile ~8 KB per tile ~8 KB per tile 50 KB/sec raw events
Historical playback ClickHouse (fast) Redis (limited retention) PostgreSQL (slow) Not supported

Flink + H3 + Redis for production heatmaps at Uber scale. Client-side rendering for prototyping or small-scale applications (< 10K events/sec) where sub-second latency is needed. PostGIS for offline analysis (not real-time). GeoHash + Redis for medium-scale systems where H3 complexity is not justified.

Scaling Math Verification

Ingestion (1M events/sec):

  • Kafka: 32 partitions, 31.25K msgs/partition/sec. Well within Kafka's limits.
  • Flink: 32 workers, each processing 31.25K events/sec. At ~100 nanoseconds per event (H3 hash + count increment): 0.3% CPU per worker.

Pre-aggregation output (50K writes/sec to Redis):

  • Redis HMSET/HINCRBY: 50K ops/sec. Single Redis instance at 25% capacity.
  • With 3 resolutions: 50K * 3 = 150K ops/sec. Still under 200K limit. Or use Redis Cluster.

Tile serving (20K requests/sec):

  • Each request: 200 Redis reads (one per cell in viewport), pipelined = 1 round trip.
  • Total Redis reads: 20K * 200 = 4M reads/sec. Need 2 Redis read replicas or a cluster.
  • Tile rendering: 200 cells -> color mapping -> PNG encoding = ~5ms. 20K * 5ms = 100 seconds of CPU/sec. Need ~100 CPU cores = 12 tile server instances (8 cores each).

CDN effectiveness:

  • Same tile requested by 100 users viewing the same area: CDN serves 1, caches for 99.
  • CDN hit rate: ~80% (many users viewing overlapping areas). Actual tile server load: 4K requests/sec. 2-3 instances sufficient.

Storage (ClickHouse):

  • 50K aggregates/sec * 32 bytes = 1.6 MB/sec = 138 GB/day compressed in ClickHouse (~20 GB with column compression).
  • 1-year retention: 7.3 TB. Manageable on a 3-node ClickHouse cluster.

Failure Analysis

Component Current capacity At 10x (10M events/sec) Breaks? Fix
Kafka (32 partitions) 31K msgs/partition/sec 312K/partition Yes Scale to 320 partitions
Flink workers (32) 31K events/worker/sec 312K/worker Maybe Scale to 100 workers, each handling 100K
Redis writes 150K ops/sec 1.5M ops/sec Yes Redis Cluster with 8 shards
Redis reads (tiles) 4M ops/sec 40M ops/sec Yes Redis Cluster + read replicas
Tile servers (12) 20K tiles/sec 200K tiles/sec Yes Scale to 120 tile servers + heavier CDN caching
ClickHouse storage 20 GB/day 200 GB/day No Larger cluster, TTL-based retention
Active cells (memory) 16 MB 160 MB No --

The first bottleneck at 10x is Redis throughput (both writes and reads). The fix is Redis Cluster with geographic sharding (cells in the same region go to the same shard, minimizing cross-shard queries for tile rendering). The second bottleneck is tile serving compute. CDN caching becomes critical at this scale -- with 80% CDN hit rate, the tile servers handle only 20% of actual requests.

At 100x (100M events/sec), the pre-aggregation in Flink becomes the primary cost center. Consider two-stage aggregation: per-machine local aggregation (no network) followed by cross-machine Flink aggregation. This reduces Kafka throughput from 100M/sec to ~1M/sec (one aggregate per machine per window).

What's Expected at Each Level

Aspect Mid-Level Senior Staff+
Aggregation strategy Write every event, query on read Pre-aggregate before storing Flink tumbling windows, 20x write reduction math
Spatial indexing Geohash grid H3 hexagonal cells, explains why hexagons Multi-resolution pre-computation, zoom-level mapping
Hot partition Not mentioned Mentions uneven event distribution CRDT G-Counters, per-worker counters, read-time summation
Client communication Polling for full tiles WebSocket for push updates Delta compression, viewport subscription
Latency analysis "Near real-time" Explains window size vs latency trade-off Full latency budget breakdown, 5-10 sec end-to-end analysis
Historical data Not mentioned "Store aggregates for playback" ClickHouse for historical, time-lapse animation, retention tiers
Scaling "Add more servers" CDN caching for tiles Two-stage aggregation, CDN hit rate math, geographic sharding
Real-world reference "Like Uber's heatmap" Mentions H3 for geospatial Flink pre-aggregation 20x reduction, CRDT for hot cells

The single most important signal at any level: do you understand that the naive approach (write every raw event, read them all back) is 20x more expensive than the pre-aggregated approach? The 20x write reduction from Flink pre-aggregation is the defining insight of this design. Everything else -- H3 multi-resolution, CRDT counters, delta compression -- is important but secondary to the fundamental decision to aggregate in-stream before storing.


References from Our Courses


Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Real-Time Geospatial Heatmap →