Skip to content

Design a Proximity Search Service

TL;DR

The entire business dataset fits in memory. That is the insight most candidates miss. 10 million businesses at 1 KB each is 10 GB -- a single Postgres instance with read replicas handles this without breaking a sweat. The real question is not "how do you shard" but "which geospatial index do you use and why." You should be able to explain five options (geohash, quadtree, R-tree, S2, H3), pick one, and defend your choice. If the interviewer bans PostGIS, build a quadtree in application memory. If they allow PostGIS, use the R-tree GiST index and spend your time on the "open now" filter, review aggregation, and search relevance -- the parts that actually make Yelp useful.


The System

Think Yelp. A user opens the app, their phone sends their GPS coordinates, and they see a list of nearby restaurants sorted by distance, rating, and relevance. They can filter by category ("pizza"), price range, rating, and whether the place is open right now.

This is a deceptively simple system. The data volume is modest (10M businesses), the write rate is near-zero (businesses update their info once a month on average), and the read pattern is straightforward (proximity search with filters). The complexity hides in two places:

  1. Geospatial indexing: Traditional B-tree indexes are one-dimensional. Latitude and longitude are two-dimensional. A composite index on (lat, lng) can range-scan one dimension but must linear-scan the other. You need a spatial data structure.

  2. The "open now" filter: This requires evaluating a function of the current time against structured business hours stored in the business's local timezone. A restaurant in Tokyo at 9 PM is not "open now" for a user querying from New York at 8 AM. Every candidate forgets this.


Requirements

Functional Requirements

Requirement Details
Proximity search Given (lat, lng, radius), return businesses within the radius
Category filter Filter by business type: restaurant, coffee shop, gym, etc.
Text search Search by business name or keywords in description
Rating filter Filter by minimum star rating (1-5)
"Open now" filter Only show businesses currently open at their local time
Business detail page View business info, photos, hours, reviews
Write a review Authenticated users can submit reviews with text and rating

Non-Functional Requirements

Requirement Target
Search latency (p50) < 200 ms
Search latency (p99) < 500 ms
Availability 99.99% (users depend on this for real-time decisions -- "where should I eat right now?")
Read:write ratio ~5000:1 (overwhelmingly read-heavy)
Scale 100M DAU, 10M businesses, 500M searches/day

Back-of-Envelope Math

Businesses:              10 million
Business record size:    ~1 KB (name, address, lat/lng, category, hours, rating, phone)
Total business data:     10M * 1 KB = 10 GB (FITS IN MEMORY)

Reviews:                 10M businesses * ~100 reviews avg = 1 billion reviews
Review size:             ~1 KB (text, rating, user_id, timestamp)
Total review data:       ~1 TB
Reviews per day:         ~100K new reviews
Review write QPS:        100K / 86400 = ~1.2/sec (trivial)

Searches per day:        500 million
Search QPS:              500M / 86400 = ~5,800 average
Peak search QPS:         ~20K (lunch and dinner hours, 3-4x average)

Geospatial index size:   10M points * 50 bytes = 500 MB (trivially fits in memory)

Search result response:  20 businesses * 500 bytes = 10 KB
Search bandwidth:        20K QPS * 10 KB = 200 MB/sec peak (manageable)

The critical realization: 10 GB of business data fits on a single machine. A well-indexed Postgres instance with 64 GB of RAM can hold the entire dataset in memory and serve 20K QPS with read replicas. You do NOT need to shard the business data. Candidates who jump straight to sharding are solving a problem that does not exist.


Naive Design

SELECT *
FROM businesses
WHERE latitude BETWEEN ? AND ?
  AND longitude BETWEEN ? AND ?
  AND category = ?
ORDER BY distance(lat, lng, user_lat, user_lng)
LIMIT 20;

Why this breaks:

  1. B-tree indexes are one-dimensional. A composite index on (latitude, longitude) efficiently range-scans latitude, then linearly scans all matching rows to check longitude. For a 5 km radius in a dense city, the latitude range might match 50,000 businesses, and you scan all of them to filter by longitude. This is O(sqrt(N)) on average, not O(log N).

  2. Earth is not flat. A BETWEEN query on latitude and longitude defines a rectangle in coordinate space. But 1 degree of longitude at the equator is 111 km, while at 60 degrees north it is only 55.5 km. A bounding box that covers 5 km at the equator covers 10 km at high latitudes. Simple BETWEEN queries produce inconsistent results at different latitudes.

  3. "Within 5 km" is a circle, not a rectangle. A bounding-box query returns ~27% more area than the actual circle (the corners are ~41% farther from the center than the edges). You always need a post-filter pass with the Haversine formula to remove false positives from the corners.


Where It Breaks

Problem 1: No Spatial Index

The BETWEEN query does not use a spatial index. Even with a composite B-tree on (lat, lng), query performance degrades as the bounding box encompasses more rows. For the generic query "restaurants near me" in Manhattan, the bounding box contains thousands of businesses.

Problem 2: Distance Computation is Expensive

The Haversine formula (great-circle distance on a sphere) involves sin, cos, arcsin, and sqrt operations. Computing it for every candidate in a large bounding box is CPU-expensive. At 20K QPS with 5,000 candidates per query, that is 100 million Haversine computations per second.

Problem 3: No Ranking Beyond Distance

Sorting purely by distance gives the wrong results. A 4.8-star restaurant 800m away should rank higher than a 2.1-star restaurant 200m away. The scoring function must combine distance, rating, review count, and relevance to the search terms.

Problem 4: "Open Now" is Missing

The most common filter in local search, and the hardest to index. Business hours are stored as structured data with timezone-dependent semantics. A simple B-tree index cannot answer "is this business open at the current time in its local timezone?"


Real Design

Proximity Search — Proximity Search High-Level Design

Architecture

          ┌──────────┐
          │  Client   │
          │  (lat,lng)│
          └────┬─────┘
          ┌────┴─────────────────┐
          │   API Gateway / LB    │
          └────┬─────────────────┘
          ┌────┴──────────────┐
          │  Search Service    │
          │  (stateless)       │
          └────┬───────┬──────┘
               │       │
     ┌─────────┴──┐ ┌──┴──────────┐
     │ PostgreSQL  │ │  Redis       │
     │ Primary +   │ │  (results    │
     │ Read Replicas│ │   cache)    │
     └─────────────┘ └─────────────┘

This is deliberately simple. The business data fits in a single Postgres instance. Read replicas handle the read throughput. A Redis cache layer handles repeated searches (same location, same filters). There is no Elasticsearch, no Kafka, no microservice mesh. The complexity is in the geospatial index and the query logic, not the architecture.

The Five Geospatial Index Options

This is the section that defines your level. A mid-level engineer says "use geohash." A senior engineer compares geohash and quadtree. A staff engineer walks through five options and picks one with a clear rationale.

Option 1: Geohash

Encode (lat, lng) into a base-32 string by interleaving latitude and longitude bits. Points sharing a geohash prefix are in the same spatial cell.

Precision table:

Characters Cell Width Cell Height Use Case
4 ~39 km ~20 km Regional search
5 ~5 km ~5 km City neighborhood
6 ~1.2 km ~600 m Walking distance
7 ~153 m ~153 m Block-level

For "restaurants within 5 km": Use precision 5 (~5 km cells). Query the user's cell plus 8 neighbors (3x3 grid). Post-filter with Haversine distance.

Implementation: Add a geohash VARCHAR(12) column to the businesses table. Create a B-tree index. Query with prefix matching: WHERE geohash LIKE '9q8yy%' OR geohash LIKE '9q8yz%' OR ... (9 cells total).

The edge problem: Two restaurants 10 meters apart can have completely different geohash prefixes if they straddle a cell boundary. The 9-cell query solves this but retrieves ~9x more candidates than necessary.

Verdict: Simple, works in any database, easy to explain. Good default for interviews.

Option 2: Quadtree (In-Memory)

A PR (Point-Region) quadtree recursively subdivides space into four quadrants. Subdivision stops when a leaf contains fewer than K points (e.g., K=100).

Why it is better than geohash: It adapts to density. Manhattan gets subdivided to tiny cells (~50m) because there are 25,000 restaurants. Rural Wyoming stays at coarse cells (~100km) because there are 50 businesses. Query performance is roughly uniform regardless of density.

Construction: Insert each business into the tree. At 10M businesses with max 100 per leaf, the tree has ~200K-500K internal nodes and uses ~50-100 MB of RAM.

Range query: Start at root. For each node, check if its bounding box intersects the search circle. If not, prune the entire subtree. If it is a leaf, check each point against the Haversine distance. This is O(log N + K) where K is the result count.

The downside: Quadtrees are in-memory-only. You build and maintain them in application code. They are not natively supported by databases (PostGIS uses R-trees, not quadtrees). But at 10M businesses, in-memory is fine.

Verdict: Best answer if the interviewer bans PostGIS and ES. Shows understanding of adaptive spatial indexing.

Option 3: R-tree (PostGIS)

R-trees are balanced trees where each node contains a set of minimum bounding rectangles (MBRs). PostGIS implements R*-trees via the GiST index.

CREATE EXTENSION IF NOT EXISTS postgis;
ALTER TABLE businesses ADD COLUMN geom geometry(Point, 4326);
CREATE INDEX idx_biz_geom ON businesses USING GIST (geom);

-- Find businesses within 5 km
SELECT *, ST_Distance(geom::geography,
    ST_MakePoint(-122.4194, 37.7749)::geography) AS dist
FROM businesses
WHERE ST_DWithin(geom::geography,
    ST_MakePoint(-122.4194, 37.7749)::geography, 5000)
ORDER BY dist
LIMIT 20;

Zero application code for geospatial logic. PostGIS handles the R-tree index, the Haversine calculation, and the distance filter. It supports kNN (nearest-neighbor), polygon queries, and arbitrary geometry operations.

Verdict: If the interviewer allows Postgres extensions, this is the staff-level answer. It is production-ready, battle-tested, and eliminates an entire class of bugs. Then spend your time on the harder parts: "open now" filtering, ranking, and caching.

Option 4: Google S2

S2 projects the Earth onto a cube and uses Hilbert curves to map 2D coordinates to 1D cell IDs. Unlike geohash (Z-order curves), Hilbert curves have better spatial locality -- consecutive points on the curve are closer together geographically.

Killer feature: Region coverings. Given a circle (center + radius), S2 computes a tight set of cells at various levels that approximate the circle. Minimal false positives compared to geohash's 9-cell approach.

Who uses it: Google Maps, CockroachDB, MongoDB (internally for 2dsphere indexes), Foursquare.

For Yelp: Overkill for 10M businesses on a single Postgres instance. S2 shines when you need precise spherical geometry at massive scale (billions of points) or when sharding by geography. Mention it to show depth, do not lead with it.

Option 5: Uber H3

H3 partitions Earth's surface into hexagons. Every hexagon has exactly 6 neighbors at equal distances (squares have 8 neighbors at two different distances). This makes radius queries natural: kRing(cell, k) returns all hexagons within k rings.

Who uses it: Uber (ride matching, surge pricing), DoorDash (delivery zones).

For Yelp: Hexagons are better for moving objects (drivers, delivery couriers) than for static businesses. H3 is not the best fit here, but mentioning it shows breadth.

Recommendation for the Interview

Situation Use This
"Use any technology" PostGIS with R-tree (zero application code)
"No external extensions" Quadtree in application memory
"Keep it simple" Geohash column with B-tree index
"Show me depth" Start with geohash, then explain why S2 is better for non-rectangular queries, reference Google Maps

The "Open Now" Filter

This is the most commonly forgotten filter and the one most interviewers will ask about if you do not bring it up.

The problem: Business hours are stored per-business in the business's local timezone. "Open now" means open at the current time in the business's timezone, not the user's timezone.

Storage:

{
  "hours": {
    "monday": [{"open": "09:00", "close": "22:00"}],
    "friday": [{"open": "09:00", "close": "02:00"}],
    "saturday": [{"open": "10:00", "close": "14:00"},
                 {"open": "17:00", "close": "23:00"}]
  },
  "timezone": "America/Los_Angeles"
}

Edge cases:

  1. Crosses midnight: Friday hours 09:00-02:00 means open until 2 AM Saturday morning.
  2. Split hours: Saturday has lunch (10-14) and dinner (17-23) as separate intervals.
  3. Holiday hours: Special overrides for holidays (closed on Christmas, reduced hours on Thanksgiving).
  4. Temporary closures: "Closed for renovation until March 15."

Why this cannot be efficiently indexed: The "open now" check requires evaluating a function of the current time against structured data. The current time changes every second. You cannot pre-index "is open at 3:47 PM Pacific on a Thursday."

Solution: Apply "open now" as a post-filter after the geospatial index and category filters have reduced the candidate set. The geospatial filter reduces 10M businesses to ~1K-10K candidates. The category filter reduces to ~200-2K. Running the "open now" check on 2K businesses is trivial -- it is just a timezone conversion and a range check per business, which completes in microseconds.

Pre-computation optimization: For each business, compute a boolean is_open flag every minute via a background cron job. Store it as a column. Now "open now" is just a WHERE is_open = true filter with a B-tree index. The 1-minute staleness is acceptable -- nobody cares if a restaurant that closed 30 seconds ago still appears in results.

Search Ranking

Do not sort by distance alone. The ranking function should combine:

score = w1 * distance_score(dist)
      + w2 * rating_score(avg_rating, num_reviews)
      + w3 * text_relevance(query, name, description)
      + w4 * category_match(query_category, biz_category)

Where:

  • distance_score: Inverse of distance, capped. 0 km = 1.0, 5 km = 0.5, 10 km = 0.2.
  • rating_score: Bayesian average of rating, weighted by review count. A 4.5-star business with 500 reviews ranks higher than a 5.0-star business with 2 reviews. Formula: (R * v + C * m) / (v + m) where R = average rating, v = number of votes, C = global average rating, m = minimum votes threshold (e.g., 10).
  • text_relevance: BM25 or trigram similarity between the query and the business name/description.
  • category_match: 1.0 if the query matches the business category, 0.5 for related categories, 0.0 otherwise.

Caching Strategy

Proximity searches are highly cacheable because:

  1. Users in the same area search for similar things.
  2. Business data changes rarely (once a month on average).
  3. The same (location, filters) query returns the same results for minutes.

Cache key: Quantize the location to a geohash prefix (precision 6, ~1.2 km cells). The cache key is geohash6:category:filters_hash. All users in the same ~1 km area querying for the same category hit the same cache entry.

Cache layer: Redis sorted set per (geohash6, category) with business IDs scored by ranking. ZREVRANGE returns the top-K. TTL of 5-15 minutes.

Cache invalidation: When a business updates its info, review, or hours, invalidate the cache entries for the geohash cells that contain it (the business's own cell + 8 neighbors). CDC from Postgres to a cache invalidation worker handles this. At ~1 review/sec write rate, cache invalidation is trivial.


Deep Dives

Proximity Search — Proximity Search Geohash

Deep Dive 1: The Geohash Edge Problem and Why Quadtrees Avoid It

The geohash edge problem is the most common gotcha in geospatial interviews.

The problem: Two restaurants 10 meters apart can have entirely different geohash prefixes. Point A at (45.0001, 90.0001) and point B at (44.9999, 89.9999) straddle both a latitude and longitude bisection boundary. Their geohash strings share no common prefix.

Why it happens: Geohash uses Z-order (Morton) space-filling curves. Z-order curves have "jumps" at certain boundaries where spatially adjacent cells are far apart in the linear ordering. Two cells that touch at corners can be maximally far apart in the Z-order curve.

Standard mitigation: Query the target cell PLUS all 8 neighbors (a 3x3 grid). This guarantees you capture all nearby points regardless of cell boundaries.

Why quadtrees do not have this problem: A quadtree range query traverses the tree from the root. At each node, it checks whether the node's bounding box intersects the search circle. It never relies on prefix matching -- it does geometric intersection tests. A point 10 meters from the query center is always in an intersecting subtree, regardless of which cell boundary it straddles.

Why S2 handles this better than geohash: S2 uses Hilbert curves instead of Z-order curves. Hilbert curves have better spatial locality -- the maximum distance between consecutive curve points is sqrt(2) cell widths, compared to (N/2) cell widths for Z-order. This means S2's cell boundary discontinuities are less severe.

Deep Dive 2: Why Sharding is Unnecessary (And When It Becomes Necessary)

At 10M businesses and 10 GB of data, a single Postgres primary with 3 read replicas handles everything:

Single Postgres read replica capacity: ~10K simple read QPS
Peak search QPS:                        ~20K
Replicas needed:                        20K / 10K = 2 (add a third for headroom)
Total machines for database:            1 primary + 3 replicas = 4 machines

When sharding becomes necessary:

Scale Change Impact Solution
100M businesses (10x) 100 GB data, still fits in memory No change needed. Postgres handles it.
1B businesses (100x) 1 TB data. Does not fit in memory. Geoshard by region. US-East, US-West, EU, APAC each get their own Postgres cluster.
10B data points (1000x) 10 TB data. Beyond single-machine Postgres. Elasticsearch cluster with geospatial sharding. Each shard covers a geographic region.
100K+ write QPS Primary cannot keep up. Write sharding by region. Each region's primary handles only local writes.

The key message: solve for the scale you have, not the scale you imagine. 10M businesses is not a hard problem. The interviewer wants to see you recognize this and focus your time on the interesting parts (geospatial indexing, ranking, "open now") rather than over-architecting a distributed system.

Deep Dive 3: Review Aggregation Without Locking

When a new review arrives:

  1. Insert the review row.
  2. Update the business's avg_rating and num_reviews.

The naive approach: UPDATE businesses SET avg_rating = (SELECT AVG(rating) FROM reviews WHERE business_id = ?) WHERE id = ?. This requires reading all reviews for the business, computing the average, and writing back. Under concurrent reviews, this is a read-write race.

The better approach: Store total_rating (sum of all ratings) and num_reviews as separate columns. Use atomic increments:

UPDATE businesses
SET total_rating = total_rating + NEW_RATING,
    num_reviews = num_reviews + 1,
    avg_rating = (total_rating + NEW_RATING)::numeric / (num_reviews + 1)
WHERE id = BUSINESS_ID;

This is a single atomic UPDATE. No SELECT, no race condition, no locking beyond Postgres's row-level lock (which is held for microseconds).

Do you need OCC (optimistic concurrency control) here? No. For rating aggregation (avg_rating, total_reviews), you don't need OCC. A simple UPDATE businesses SET avg_rating = (avg_rating * total_reviews + :new_rating) / (total_reviews + 1), total_reviews = total_reviews + 1 WHERE id = :id is atomic in PostgreSQL under Read Committed isolation. Two concurrent reviews both succeed -- the second UPDATE waits for the first to commit, then reads the updated values. OCC adds complexity without benefit here.


Alternative Designs

Alternative 1: Elasticsearch for Everything

Use ES as both the geospatial index and the text search index. A single bool query combines geo_distance, text match, category filter, and rating filter.

Pros: One query, one service, supports BM25 text relevance, function_score for custom ranking.

Cons: ES is not a primary data store (no ACID, potential data loss under partitions). Requires CDC from the primary DB to keep ES in sync. Adds a service to the architecture. For 10M businesses, this is over-engineering.

When to use: If the text search component is as important as the geo component (e.g., the user frequently searches by business name or menu items). If the business count grows to 100M+. If you need learning-to-rank models for relevance.

Alternative 2: Redis GEO

Redis has built-in geospatial support: GEOADD, GEOSEARCH, GEODIST. Under the hood, it uses sorted sets with geohash-encoded scores.

Pros: Sub-millisecond latency. Simple API. Perfect for "find N nearest businesses."

Cons: All data in memory (fine for 10M businesses = ~500 MB). No text search. No complex filters beyond radius. Cannot combine geo + text + category in a single query.

When to use: If the only query pattern is "nearest N businesses of type X" with no text search, no complex ranking. Good for a read-through cache layer in front of Postgres.

Alternative 3: In-Memory Quadtree + PostgreSQL Hybrid

Build a quadtree in application memory containing (business_id, lat, lng, category, is_open). Use the quadtree for the geospatial + category filter (fast, in-memory). Then fetch full business details from Postgres for only the top-20 results.

Pros: Fastest possible geospatial queries. The quadtree fits in ~500 MB. No PostGIS dependency.

Cons: Must rebuild the quadtree on application restart. Must handle quadtree updates when businesses change (though at ~1 update/sec, this is trivial). Adds application complexity.

When to use: When PostGIS is not available and you need sub-millisecond geospatial queries.


Scaling Math

Database Sizing

Business data:          10M * 1 KB = 10 GB
Geospatial index:       10M * 50 bytes = 500 MB (GiST index)
Text search index:      10M * 200 bytes = 2 GB (GIN trigram index)
Category index:         10M * 20 bytes = 200 MB
Total DB size:          ~15 GB (fits in 64 GB RAM instance with room to spare)

Review data:            1B * 1 KB = 1 TB
Review index (by biz):  1B * 20 bytes = 20 GB
Total review storage:   ~1.02 TB (needs larger storage, but still single instance)

Query Path

Geospatial filter:      10M -> ~5K candidates (within 5 km in a city)
Category filter:        5K -> ~1K (restaurants only)
"Open now" filter:      1K -> ~700 (70% are open during lunch)
Text search + ranking:  700 -> top 20 results
Haversine post-filter:  ~700 distance calculations = ~70 microseconds

Total query time:       ~5-10 ms (cache miss), <1 ms (cache hit)

Cache

Redis memory:
  10 regions * 25 categories * top-500 per (region, category) * 100 bytes
  = 10 * 25 * 500 * 100 = 125 MB
  With 10 geohash cells per region: 1.25 GB
  Total Redis: ~2 GB (single instance)

Cache hit rate:         ~80-90% (similar queries from users in same area)
Backend QPS after cache: 20K * 0.15 = ~3K QPS (easily handled by 1-2 replicas)

Failure Analysis

Failure Impact Mitigation
Postgres primary goes down No writes (reviews, business updates). Reads continue from replicas. Automated failover. Replica promoted to primary within 30 seconds.
Read replica goes down Reduced read capacity. Other replicas absorb load. Auto-scaling group replaces the replica. Load balancer routes around failed replica.
Redis cache goes down All queries hit Postgres directly. Latency increases from <1ms to 5-10ms. QPS spikes on replicas. Redis Sentinel for automatic failover. Degrade gracefully -- Postgres can handle the load, just slower.
GPS drift (user's location is inaccurate) Results are for the wrong area. Snap to known locations (address, POI). Allow manual location correction. Show map with pin for user to verify location.
Stale business hours "Open now" filter shows closed businesses or hides open ones. Crowdsourced corrections ("Suggest an edit"). Background verification via Google Maps API. Accept 5-10% error rate -- users understand hours data is approximate.
Hot region (Times Square at New Year's) Thousands of queries for the same small area in seconds. Cache absorbs the load. Geohash-based cache key means all queries for Times Square hit the same cache entry.
Review spam Fake reviews inflate/deflate ratings. Fraud detection pipeline. Rate-limit reviews per user. Require verified purchase or check-in. Flag reviews with anomalous patterns for human review.

Level Expectations

Level What the Interviewer Expects
Mid (L4) Use lat/lng bounding box with post-filter. Mention geohash at a high level. Basic CRUD API. Postgres schema. This passes but does not impress.
Senior (L5) Explain geohash AND quadtree, compare their strengths. Use PostGIS or build an in-memory quadtree. Discuss the edge/boundary problem. Implement the "open now" filter. Ranking beyond just distance (rating, relevance). Caching with geohash-based keys. Show that sharding is unnecessary at this scale.
Staff (L6) Walk through all five geospatial options (geohash, quadtree, R-tree, S2, H3) and justify the choice. Explain Hilbert curves vs Z-order curves. Discuss the filter cascade (geo first, then category, then "open now", then text). Bayesian average for review scores. Pre-computed is_open flag via background job. Quantified scaling math showing sharding is unnecessary. Reference real systems (Yelp, Google Maps, Foursquare) and their technology choices.

References from Our Courses


Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Proximity Search Service →