Design a Fitness Activity Tracker
TL;DR
Build Strava — record GPS activities, process noisy sensor data into clean routes, power segment leaderboards, and sync everything offline-first.
- The defining requirement: Offline-first. Runners lose signal on trails. The app records with zero connectivity and uploads later.
- The hard part: Raw GPS is noisy and wasteful. A 4-stage pipeline (Kalman → Haversine → RDP → polyline encoding) compresses 81 KB of raw GPS down to 500 bytes — a 162x reduction.
- The scaling story: Segment leaderboards evolved from SQL → Redis sorted sets → Cassandra + Kafka as Strava grew from thousands to 120 million users.
- Scale: 5.7M activities/day, 30M segments, 120M registered users.
The System
Strava. A runner opens the app, taps "Start," runs for 45 minutes, and taps "Stop." The app has recorded 2,700 GPS points (one per second), tracked elevation changes, calculated pace per mile, and detected that the runner crossed three popular segments (predefined routes that all users compete on). The activity is uploaded, and the runner sees their rank on each segment leaderboard. Their friends see the activity in their feed with a route map.
Why is this a system design problem? Because GPS data is noisy, activities happen offline, leaderboards require real-time global ranking, and 120 million users generate 40 million activities per week. The sensor processing pipeline is unique to fitness tracking -- you will not find it in a web application textbook. The offline-first requirement fundamentally changes the data model: every write must be idempotent because the client does not know if a previous upload succeeded. And the leaderboard system must handle both historical data (millions of past efforts on a segment) and real-time updates (new efforts arriving at 500/sec during peak hours).
Requirements
Functional Requirements
| Requirement | Details |
|---|---|
| Activity recording | Record GPS coordinates, heart rate, cadence, elevation during an activity. Works fully offline. |
| Activity upload | Upload recorded activity when connectivity is available. Idempotent. |
| GPS processing | Clean noisy GPS data. Calculate distance, pace, elevation gain. Detect pauses. |
| Route display | Show activity route on a map with color-coded pace/elevation. |
| Segment leaderboards | Detect segment matches in an activity. Rank users on segment leaderboards. |
| Social feed | Show friends' activities with summary and route map. |
| Privacy zones | Hide start/end of activities near home/work within a configurable radius. |
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Offline recording | Full activity recording with zero connectivity. |
| Upload latency | < 30 seconds from upload start to activity visible in feed. |
| GPS recording frequency | 1 Hz (1 point/second) during activity. |
| Segment detection latency | < 60 seconds after upload. |
| Leaderboard query latency | < 200 ms. |
| Scale | 120M registered users, 10M WAU, 40M activities/week, 5.7M activities/day. |
Back-of-Envelope Math
Activities per day: 5.7 million (Strava's reported number)
Activities per second: ~66 avg, ~200 peak (Sunday morning in US + evening in Europe)
GPS data per activity:
Duration: 45 min avg
Points: 45 * 60 = 2,700 points at 1 Hz
Raw point size: (lat: 8B, lng: 8B, elevation: 4B, timestamp: 8B, hr: 2B) = 30 bytes
Raw activity GPS data: 2,700 * 30 = 81 KB
After RDP compression: ~200 points retained = 6 KB
After polyline encoding: ~500 bytes
Segments:
Total segments: ~30 million worldwide
Segment match per activity: ~3 avg (runners in populated areas)
Segment detections/day: 5.7M * 3 = 17.1M
Segment detections/sec: ~198
Leaderboard queries:
Per activity upload: 3 segment leaderboards queried
Plus browsing: ~5M leaderboard views/day
Total leaderboard reads: 5.7M * 3 + 5M = ~22M/day = ~255/sec
Storage:
Activity metadata: ~1 KB (title, stats, user_id, sport_type)
Activity GPS (compressed): ~500 bytes (polyline-encoded)
Activity photos: ~2 MB avg (30% of activities have photos)
Per activity total: ~600 KB avg (with photos factored in)
Daily storage: 5.7M * 600 KB = 3.4 TB/day
Annual: 1.2 PB
The key insight: raw GPS data is 81 KB per activity. After the processing pipeline, it is 500 bytes. That is a 162x compression ratio. Without this pipeline, storage costs would be 162x higher.
Naive Design
Mobile app records GPS. Uploads raw points. Server stores in PostgreSQL. Leaderboards are SQL queries.
Core Entities
Before drawing boxes, define what data moves through the system:
- User — ID, profile, followers list
- Activity — A single workout session (sport type, start/end time, total distance, elevation gain)
- Route (GPS Data) — Raw coordinates recorded during an activity (lat, lng, elevation, timestamp per point)
- Segment — A predefined geographic stretch of road/trail that all users compete on
- Segment Effort — A user's elapsed time on a specific segment during a specific activity
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/activities |
POST | Upload a completed activity (idempotent via UUID) |
/api/activities/:id |
GET | Get activity details with route and stats |
/api/feed |
GET | Get the user's social feed (paginated) |
/api/segments/:id/leaderboard |
GET | Get top 100 for a segment |
/api/users/:id/activities |
GET | List a user's activities (paginated) |
Naive Schema
CREATE TABLE activities (
activity_id UUID, user_id UUID, sport_type VARCHAR(20),
start_time TIMESTAMPTZ, end_time TIMESTAMPTZ,
distance_meters FLOAT, elevation_gain FLOAT
);
CREATE TABLE gps_points (
activity_id UUID, sequence INT,
lat DOUBLE, lng DOUBLE, elevation FLOAT,
timestamp TIMESTAMPTZ, heart_rate INT,
PRIMARY KEY (activity_id, sequence)
);
CREATE TABLE segments (segment_id UUID, name VARCHAR, polyline TEXT);
CREATE TABLE segment_efforts (
segment_id UUID, user_id UUID, activity_id UUID,
elapsed_time INT, -- seconds
PRIMARY KEY (segment_id, user_id, activity_id)
);
Leaderboard query:
SELECT user_id, MIN(elapsed_time) as best_time
FROM segment_efforts
WHERE segment_id = :segment
GROUP BY user_id
ORDER BY best_time
LIMIT 100;
Works for 1,000 users. Falls apart at Strava scale.
Where It Breaks
Problem 1: Raw GPS Storage Is Wasteful
2,700 GPS points per activity * 30 bytes each = 81 KB. At 5.7M activities/day, that is 462 GB/day of raw GPS data. Most of these points are redundant -- when running in a straight line, 50 consecutive points say essentially the same thing. You are paying to store information that has zero visual or analytical value.
Problem 2: GPS Data Is Noisy
Consumer GPS has 3-10 meter accuracy. In urban areas with tall buildings (GPS multipath), accuracy drops to 30+ meters. A runner who runs a straight road will have GPS points zigzagging across the road. If you naively calculate distance by summing straight-line distances between consecutive points, you overestimate by 5-15%. That "10K run" becomes 11.2K.
Problem 3: Segment Leaderboard SQL Is O(N)
SELECT ... GROUP BY user_id ORDER BY best_time scans every row in segment_efforts for that segment. A popular segment in Central Park might have 500,000 efforts. This query takes 2-5 seconds. At 255 queries/sec, you need dedicated database servers for leaderboards.
Problem 4: Offline Data Creates Upload Conflicts
User records a run offline. Uploads at home. But the app crashes mid-upload. Is the activity saved or not? The user retries the upload. Now there are two copies. Without idempotent uploads, the activity appears twice in the feed. The leaderboard counts the same effort twice.
Problem 5: Privacy Zones Leak Information
Strava's heatmap (aggregated GPS data from all users) revealed the locations of secret military bases because soldiers were tracking their runs. Individual privacy zones (hiding 200 meters around home) can be defeated by observing where multiple activities start and end -- the cluster center reveals the home location even if individual start points are hidden.
Real Design
![]()
Architecture Overview
┌──────────────┐
│ Mobile App │ ── records offline, syncs when connected
│ (offline │
│ storage) │
└──────┬───────┘
│ (upload when online)
┌──────┴───────┐
│ Upload API │ ── idempotent upload with activity hash
└──────┬───────┘
│
┌──────┴───────┐
│ Processing │ ── GPS pipeline (Kalman → distance → RDP → encode)
│ Pipeline │ ── segment detection
│ (async) │ ── statistics computation
└──────┬───────┘
│
┌──────┴───────────────┐
│ ┌────────────┐ │
│ │ Activity │ │
│ │ Store │ │
│ │ (Postgres) │ │
│ └────────────┘ │
│ ┌────────────┐ │
│ │ Leaderboard│ │
│ │ Service │ │
│ │ (Cassandra │ │
│ │ + Redis) │ │
│ └────────────┘ │
│ ┌────────────┐ │
│ │ Feed │ │
│ │ Service │ │
│ └────────────┘ │
└──────────────────────┘
Component 1: Offline-First Architecture
This is the defining characteristic. The app must work with zero connectivity for the entire duration of an activity.
Client-side storage:
The mobile app stores activity data in a local SQLite database (iOS) or Room database (Android):
Local table: pending_activities
activity_uuid: UUID (generated client-side)
sport_type: VARCHAR
gps_points: BLOB (compressed array of GPS points)
heart_rate_data: BLOB (compressed array)
status: ENUM(recording, completed, uploading, uploaded)
created_at: TIMESTAMP
Upload protocol:
1. App generates a deterministic activity_uuid from:
hash(user_id + start_timestamp + first_5_gps_points)
2. App uploads: POST /api/activities
Header: Idempotency-Key: {activity_uuid}
Body: {
uuid: activity_uuid,
gps_points: [...],
sport_type: "run",
...
}
3. Server checks: does activity with this UUID exist?
If yes: return 200 (already processed). No duplicate.
If no: process and store. Return 201.
4. App marks local status as "uploaded" on 200 or 201.
Why client-generated UUID? Because the client cannot check with the server whether the activity already exists (it might be offline). The UUID must be deterministic from the activity data so that duplicate uploads produce the same UUID.
Battery optimization: GPS recording at 1 Hz drains battery. Optimizations:
- Use "significant location changes" API (iOS) when the user is stationary (paused at a traffic light). Resume 1 Hz when motion resumes.
- Compress GPS buffer to disk every 5 minutes (in case the app is killed by the OS).
- Disable screen during recording (GPS does not need the screen).
Component 2: GPS Processing Pipeline
Raw GPS data is noisy. The processing pipeline turns it into clean routes and accurate statistics.
Stage 1: Kalman Filter (smoothing)
A Kalman filter estimates the true position from noisy GPS readings by modeling the runner's motion.
State: (lat, lng, velocity_lat, velocity_lng)
Measurement: (gps_lat, gps_lng) with uncertainty ~5 meters
predict(dt):
lat += velocity_lat * dt
lng += velocity_lng * dt
increase uncertainty
update(measurement):
kalman_gain = predicted_uncertainty / (predicted_uncertainty + measurement_uncertainty)
lat = predicted_lat + kalman_gain * (measured_lat - predicted_lat)
lng = predicted_lng + kalman_gain * (measured_lng - predicted_lng)
decrease uncertainty
Effect: The zigzagging GPS track becomes a smooth path that follows the road. Distance calculation on the smoothed path is accurate to within 1-2%.
Stage 2: Distance Calculation (Haversine)
For each consecutive pair of smoothed points, calculate the great-circle distance:
where R = 6,371 km (Earth's radius). Sum all consecutive distances to get total activity distance.
Stage 3: Ramer-Douglas-Peucker Compression (RDP)
RDP removes redundant points that add no visual information. The algorithm:
- Draw a straight line from the first point to the last point.
- Find the point farthest from this line.
- If the farthest point is < epsilon (e.g., 5 meters) from the line, discard all intermediate points.
- If the farthest point is >= epsilon, keep it and recursively apply to the two sub-segments.
Effect: 2,700 GPS points become ~200 points. The route looks identical on a map (within 5 meters). Storage drops from 81 KB to 6 KB.
Stage 4: Google Polyline Encoding
The 200 retained points are encoded using Google's polyline encoding algorithm:
- Encode each coordinate as a delta from the previous coordinate.
- Multiply by 1e5 and round to integer.
- Encode each integer as a variable-length sequence of ASCII characters.
Result: 200 points encode to ~500 bytes of ASCII text. This is what gets stored in the database and embedded in map URLs.
Pipeline execution time: All four stages run in ~50 ms per activity on a single CPU core. At 200 activities/sec peak, that is 200 * 50 ms = 10 seconds of CPU per second = 10 cores for the GPS pipeline. Trivial.
GPS write throughput note: At scale (100M DAU with an average 600 GPS points per 10-minute activity), raw GPS ingestion could hit 60K writes/sec if streamed in real-time. A single PostgreSQL instance approaches its limits at this volume. Options: (a) client-side batching -- the app accumulates GPS points locally and uploads the complete route on activity save (one bulk insert per activity, not per GPS point), (b) append-only storage like Cassandra or TimescaleDB for real-time GPS ingestion if live tracking is required, (c) write to S3 as a GPS file (GPX/JSON) and only store the processed summary in PostgreSQL. Strava uses approach (a) -- the offline-first architecture means GPS data stays on-device until the activity is complete, converting a real-time streaming problem into a manageable batch upload.
Component 3: Segment Detection
After processing the GPS data, the system checks if the activity crosses any of the 30 million segments.
Naive approach: Compare the activity's route against all 30M segments. O(N * M) where N = activity points and M = total segments. At 200 points * 30M segments, this is 6 billion comparisons per activity. Obviously impossible.
Geospatial pre-filter: Compute the bounding box of the activity route. Query a spatial index (R-tree or S2 cells) for segments that overlap the bounding box. A typical activity in a populated area overlaps ~500 segments (out of 30M). Comparison drops to 200 * 500 = 100,000 -- done in ~10 ms.
Matching algorithm: For each candidate segment, check if the activity's route passes through all waypoints of the segment in order, within a tolerance of 25 meters. If yes, calculate the elapsed time between the segment start and end waypoints. Record as a segment effort.
Implementation: The segment spatial index is partitioned by S2 cell and cached in Redis. Each worker loads the segments for the activity's bounding box cells and runs the matching in memory. 3 segment matches per activity * 5.7M activities/day = 17.1M new segment efforts per day.
Component 4: Leaderboard Evolution (SQL -> Redis -> Cassandra + Kafka)
Strava's leaderboard has been rewritten multiple times as scale grew. This evolution is instructive.
Version 1 (SQL, < 100K users):
SELECT user_id, MIN(elapsed_time)
FROM segment_efforts
WHERE segment_id = :id
GROUP BY user_id ORDER BY 1 LIMIT 100;
Works fine. Query takes ~50ms with an index on (segment_id, elapsed_time).
Version 2 (Redis sorted set, 100K-10M users):
Key: segment:{segment_id}:leaderboard
Type: Sorted Set
Members: user_id
Scores: best elapsed time in seconds
ZADD segment:123:leaderboard 1234 user_456 -- user 456's best time: 1234 seconds
ZRANGE segment:123:leaderboard 0 99 -- top 100
ZRANK segment:123:leaderboard user_456 -- user 456's rank
Redis sorted set operations are O(log N). Query time: < 1 ms. Problem: 30M segments * 100 bytes avg per leaderboard * 100 entries avg = 300 GB of Redis memory. Expensive but feasible.
Version 3 (Cassandra + Kafka, 10M+ users):
At 120M users, popular segments have millions of efforts. Redis memory becomes prohibitive. Strava moved to:
- Cassandra: Stores all segment efforts. Partitioned by segment_id. Ordered by elapsed_time within each partition. Top-K query: read the first 100 rows of the partition. O(1) per partition read. No GROUP BY needed.
- Kafka: New segment efforts flow through Kafka. A consumer updates the Cassandra table and a Redis cache for the top 100 (hot leaderboard).
- Redis: Still caches the top 100 for fast reads. But the source of truth is Cassandra.
Current architecture: Cassandra for storage and full leaderboard. Redis for cached top-100. Kafka for real-time updates. The consumer also invalidates the Redis cache when a new effort displaces someone in the top 100.
Component 5: Privacy Zones
Users set privacy zones around sensitive locations (home, work, school). Activities that start or end within the zone have their GPS data hidden within the zone radius.
Implementation:
On activity display:
for each privacy_zone of the viewing user:
if activity start/end is within zone radius:
trim GPS points within the zone
replace with a straight line from zone boundary to actual start/end
(or simply start/end the visible route at the zone boundary)
The cluster attack: If an attacker observes 100 activities from the same user, each starting at a different point on the zone boundary, the center of the cluster reveals the user's home. Mitigation:
- Random offset: Add a random but consistent offset (per-user seed) to the zone center. The zone appears shifted, so the cluster center is 200 meters from the actual home.
- Suppress nearby activities entirely: Do not show the first/last 200 meters at all (no line, no dot). The activity appears to start "out of nowhere." This is what Strava does by default.
- Heatmap exclusion: Strava's global heatmap excludes all GPS data within privacy zones. This prevents the military base exposure problem.
Component 6: File Format Support (FIT/GPX/TCX)
Users upload from Garmin, Wahoo, Polar in various formats. FIT (binary, ~500 KB/hr) is compact; GPX (XML, ~2 MB/hr) is human-readable. The upload API detects the format via magic bytes, parses with a format-specific parser, and normalizes to our internal representation before the GPS pipeline runs.
Deep Dives
![]()
Deep Dive 1: Pause Detection and Auto-Pause
Runners stop at traffic lights. Cyclists stop at cafes. The system must detect pauses and exclude them from moving time.
Speed-based detection:
for each GPS point:
speed = distance(prev_point, curr_point) / time_delta
if speed < 1.0 m/s (3.6 km/h):
mark as paused
elif speed < 2.0 m/s and prev_was_paused:
still paused (hysteresis to avoid flicker)
else:
mark as moving
Metrics impacted:
- Moving time: Total time minus paused time. Used for pace calculation.
- Elapsed time: Total time including pauses. Used for segment efforts (Strava uses elapsed time for segment leaderboards -- pausing on a segment does not help you).
- Average pace: Calculated from moving time, not elapsed time.
Edge case: A runner doing interval training (sprint, walk, sprint) has frequent speed drops. The system must not count walking intervals as pauses. Solution: only mark as paused if speed < 1.0 m/s for > 10 seconds.
Deep Dive 2: Elevation Data
GPS-derived elevation has 15-30 meter accuracy — a flat road can show 200 meters of "elevation gain" from noise alone. Two fixes:
- Barometric altimeter (GPS watches): Accurate to ~1 meter. Device fuses barometric + GPS elevation via Kalman filter.
- DEM correction (phones): Server looks up true elevation at each (lat, lng) from a Digital Elevation Model (SRTM, 30m resolution). O(1) lookup per point. Strava uses this for all phone activities.
Elevation gain: Sum positive deltas after smoothing, with a 2-meter dead band to filter noise: sum(max(0, elev[i] - elev[i-1]) for i in range(1, len(points))).
Deep Dive 3: Social Feed Generation
When a user uploads an activity, their followers see it in their feed.
Follower graph storage: The friendship table needs a composite key -- (follower_id, followed_id) -- not just follower_id alone. One user follows many others, so follower_id alone cannot be a unique key. Partition by follower_id with followed_id as sort key for "who do I follow?" queries. Create a reverse index partitioned by followed_id for "who follows me?" fan-out lookups.
Fan-out on write: When user A uploads, write a feed entry for each of A's followers. Strava's average user has ~50 followers. 5.7M activities/day * 50 followers = 285M feed entries/day.
Feed entry: (follower_id, activity_id, activity_user_id, timestamp, summary). ~200 bytes per entry. 285M * 200 = 57 GB/day. Stored in a denormalized table partitioned by follower_id for fast feed reads.
Feed read: SELECT * FROM feed WHERE follower_id = :me ORDER BY timestamp DESC LIMIT 20. With a partition key on follower_id, this is a single partition read in Cassandra: < 5ms.
Celebrity problem: A user with 1M followers (pro athletes) creates 1M feed entries per activity. Fan-out on write is expensive. Solution: hybrid model. Users with > 10K followers use fan-out on read: when a follower opens their feed, the system checks if any celebrities they follow have new activities and merges them into the pre-computed feed. This is the same Twitter fan-out hybrid pattern.
Alternative Designs
| Approach | Pros | Cons | When to Use |
|---|---|---|---|
| Offline-first + GPS pipeline + Cassandra leaderboard (described above) | Handles 120M users. Accurate GPS processing. Scalable leaderboards. | Complex pipeline. Multiple storage systems. | Strava, Garmin Connect, Nike Run Club at scale. |
| Firebase Realtime Database | Real-time sync built in. Offline-first out of the box. | No GPS processing. 1 GB storage limit per database. Cannot handle 30M segment leaderboards. | Prototype fitness app. < 10K users. |
| PostgreSQL with PostGIS | Spatial queries for segment detection. ACID for leaderboards. Simple. | Leaderboard queries degrade past 1M efforts. PostGIS segment detection is slow for 30M segments. | Small-scale fitness tracker. < 100K users. |
| TimescaleDB for GPS data | Time-series optimized. Compression built in. SQL interface. | No native leaderboard support. Still need a separate system for segments. | When GPS data analytics (aggregate trends, heatmaps) are primary use case. |
| Client-side processing only | Zero server compute for GPS pipeline. Immediate results. | Different devices produce different results (non-deterministic). Cannot validate for anti-cheating. | Casual fitness apps where accuracy and fairness are not critical. |
Scaling Math Verification
GPS Processing Pipeline
Activities per second (peak): 200
Processing time per activity: 50 ms (Kalman + distance + RDP + encoding)
CPU needed: 200 * 50 ms = 10 seconds/sec = 10 cores
Servers (8 cores each): 2 servers
Segment detection per activity: 10 ms (after spatial pre-filter)
CPU for segment detection: 200 * 10 ms = 2 seconds/sec = 2 cores
Total pipeline servers: 2-3 servers (trivial)
Leaderboard Storage
Total segments: 30 million
Avg efforts per segment: 100 (most segments are unpopular)
Popular segments (top 1%): 300K segments with 10K+ efforts each
Total effort records: 30M * 100 + 300K * 10K = 6B records
Cassandra:
Record size: ~100 bytes
Total storage: 6B * 100 = 600 GB
Partitioned by segment_id
Sorted by elapsed_time within partition
Redis cache (top 100):
Popular segments cached: 300K
Cache size: 300K * 100 * 50 bytes = 1.5 GB
Activity Storage
Daily activities: 5.7M
Storage per activity: 1.5 KB (metadata + compressed GPS) + 600 KB avg (with photos)
Daily storage: 5.7M * 600 KB = 3.4 TB
Annual: 1.24 PB
Storage cost (S3): $0.023/GB = $28K/TB/year
Annual storage cost: 1,240 TB * $28K = $34.7M (this is why compression matters)
GPS compression savings:
Uncompressed GPS/activity: 81 KB
Compressed GPS/activity: 500 bytes
GPS savings per activity: ~80.5 KB
Daily GPS savings: 5.7M * 80.5 KB = ~458 GB/day = 167 TB/year
Annual GPS cost avoided: 167 TB * $28K = $4.7M/year
Without compression, GPS alone would add $4.7M/year in storage costs.
Upload and Sync
Uploads per second (peak): 200
Upload size: 200 KB avg (compressed FIT file)
Upload bandwidth: 200 * 200 KB = 40 MB/sec (trivial)
Processing queue: Kafka with 20 partitions
Consumer group: 20 consumers (one per partition)
Processing lag: < 5 seconds (activity visible in feed within 30 seconds including pipeline)
Failure Analysis
| Failure | Impact | Mitigation |
|---|---|---|
| App crashes during recording | GPS data since last checkpoint is lost. User loses part of their activity. | Write GPS buffer to disk every 60 seconds. On crash recovery, prompt user to resume from last checkpoint or save the partial activity. |
| Upload fails (server error) | Activity stuck in local storage. User sees "upload pending." | Retry with exponential backoff: 10s, 30s, 60s, 300s. Idempotent upload prevents duplicates. Local activity is viewable (with limited features) even before upload. |
| GPS processing pipeline bug | Activities show wrong distance, pace, or route. Users complain. | Reprocess affected activities from raw GPS data (always retained for 90 days). A/B test pipeline changes. Canary deployment with comparison against reference activities. |
| Segment detection misses a segment | User does not see their effort on the leaderboard. Frustration. | Log candidate segments and match scores. If a user reports a missed segment, replay detection with debug logging. Common cause: GPS drift near segment start/end points. Widen tolerance from 25m to 50m for reported misses. |
| Leaderboard Redis cache crash | Leaderboard queries hit Cassandra directly. Latency increases from < 1 ms to ~50 ms. | Rebuild Redis cache from Cassandra. Takes ~10 minutes for top 300K segments. Cassandra latency (50 ms) is acceptable during rebuild. |
| Privacy zone misconfigured | User's home location is exposed through activity data. | Double-check zone during onboarding. Warn user if an activity starts within 200m of their home but outside a privacy zone. Retroactively apply new privacy zones to historical activities. |
| Cheating (GPS spoofing) | Fake leaderboard times. Destroys competition integrity. | Speed/acceleration anomaly detection: flag activities with > 40 km/h running speed or physically impossible acceleration. Cross-reference with heart rate (if available). Community flagging system. |
Level Expectations
| Level | What the Interviewer Expects |
|---|---|
| Mid (L4) | Store GPS points in database. Basic distance calculation. Simple leaderboard with SQL query. Upload API. Mentions offline recording as a requirement. |
| Senior (L5) | Kalman filter for GPS smoothing. RDP compression with quantified reduction (2700 -> 200 points). Polyline encoding for storage. Offline-first with idempotent upload (deterministic UUID). Segment detection with spatial pre-filtering. Redis sorted set for leaderboards. FIT/GPX file format support. |
| Staff+ (L6) | Leaderboard evolution narrative (SQL -> Redis -> Cassandra + Kafka) with scale thresholds. DEM-based elevation correction. Privacy zone attack analysis (cluster attack). Barometric vs. GPS elevation trade-off. Fan-out on write vs. read for social feed (celebrity problem). Pause detection with hysteresis. Quantified storage savings from compression pipeline. Reference to Strava heatmap military base incident. |
References from Our Courses
- Spatial Indexing — route mapping and GPS trace storage
- Time-Series Databases — efficient storage of heart rate and activity metrics
- Partition and Clustering Keys — Cassandra model for user activity time-series
- Redis Data Structures and Use Cases — caching daily step counts and leaderboards
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.