Design a Photo-Sharing Platform
If the URL shortener is a database problem and Dropbox is a synchronization problem, Instagram is a fan-out problem. The technical crux is not uploading photos or storing them -- that is solved infrastructure. The crux is delivering a personalized feed to 500 million people every day, including the moment a celebrity with 500 million followers posts a photo. That single post needs to appear in half a billion feeds. How fast can you do that?
The System
A photo-sharing platform lets users upload photos, follow other users, and view a feed of photos from people they follow. Users can like and comment on posts. The feed is the core product -- it is where users spend 90% of their time.
Instagram launched in 2010 and hit 30 million users within 18 months. At the time of its $1 billion acquisition by Facebook in April 2012, the entire backend was run by 3 engineers. Three. The stack was Django, PostgreSQL, Redis, S3, CloudFront, Celery, and Memcached -- all running on about 100 Amazon EC2 instances. No DynamoDB. No custom storage. No microservices. Just a well-sharded Postgres cluster and smart use of Redis for the feed cache. The "SQL can't scale" narrative dies on contact with Instagram's actual architecture.
By 2016, Instagram was serving 500 million monthly active users on fundamentally the same architecture, scaled up and absorbed into Facebook's infrastructure. The photo storage migrated to Haystack (Facebook's custom storage system), but the metadata layer remained PostgreSQL-based.
Requirements
Functional
- Post photos: Users upload photos with captions (max 8 MB per image)
- Follow users: Asymmetric follow model (you can follow someone without them following you back)
- View feed: See a reverse-chronological (or ranked) feed of posts from followed users
- Like/comment: Interact with posts (secondary, but commonly asked as follow-up)
Non-Functional
- Feed latency: Feed loads in under 200ms at the 99th percentile (users are impatient -- Instagram's feed must feel instant)
- Upload latency: Photo upload completes (from user perspective) in under 2 seconds, even though background processing continues
- Write throughput: 100M new posts per day (~1,160 posts/sec)
- Read throughput: 500M DAU, each opening the feed ~10 times/day = 5B feed reads/day (~58,000 reads/sec)
- Storage: Store all photos permanently (no TTL)
- Availability: 99.99% -- feed downtime means users open a competitor's app
Back-of-Envelope Math
Post volume:
100M posts/day
100,000,000 / 86,400 = ~1,160 posts/sec average
Peak (5x): ~5,800 posts/sec
Feed reads:
500M DAU * 10 opens/day = 5B reads/day
5,000,000,000 / 86,400 = ~58,000 reads/sec average
Peak (3x): ~174,000 reads/sec
Photo storage:
Each photo: 4 variants (thumbnail 150px ~15KB, feed 640px ~100KB,
full 1080px ~300KB, original ~1MB)
Total per photo: ~1.4 MB across all variants
Daily: 100M * 1.4 MB = 140 TB/day
Annual: 140 TB * 365 = ~51 PB/year
Metadata storage:
Each post: post_id (8B) + user_id (8B) + caption (200B) +
image_url (100B) + timestamps (16B) + counters (16B) = ~350 bytes
Daily: 100M * 350B = 35 GB/day
Annual: ~12.8 TB/year
Much smaller than photo storage. Database is not the problem.
Fan-out math (the hard one):
Average user has 500 followers
Average post fans out to 500 feed caches
1,160 posts/sec * 500 followers = 580,000 Redis writes/sec for fan-out
Celebrity post (500M followers):
At 100K Redis writes/sec per worker = 500,000,000 / 100,000 = 5,000 seconds = 83 minutes
A single celebrity post takes 83 minutes to fan out. Obviously unacceptable.
CDN bandwidth:
500M DAU * 50 images/session * 100KB average = ~2.5 PB egress/day from origin
With CDN (80% cache hit): origin serves ~500 TB/day
Edge serves ~2 PB/day at <50ms latency globally
The Naive Design
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Client │────>│ API Server │────>│ PostgreSQL │ │ S3 │
│ │<────│ │<────│ │ │ (photos)│
└──────────┘ └──────────────┘ └──────────────┘ └──────────┘
POST /posts (upload photo + caption)
-> Store photo in S3
-> INSERT INTO posts (user_id, image_url, caption)
GET /feed
-> SELECT p.* FROM posts p
JOIN follows f ON f.following_id = p.user_id
WHERE f.follower_id = :current_user
ORDER BY p.created_at DESC
LIMIT 20
Clean. Simple. Falls apart at scale.
Where Does This Break First?
That feed query is a JOIN between the follows table (billions of rows) and the posts table (tens of billions of rows), filtered by a user's follow list, sorted by time, and limited to 20 results. Even with good indexes, this query touches too many partitions when a user follows 1,000 accounts. At 58,000 feed reads/sec, you are running 58,000 of these expensive JOINs per second. The database dies.
Where It Breaks
The feed query is the bottleneck. A user following 1,000 accounts forces the database to scan recent posts from 1,000 different users, merge-sort them by time, and return the top 20. With 58,000 of these queries per second, no amount of read replicas saves you.
The fundamental insight: computing the feed at read time is too expensive. You must pre-compute it.
This is the fan-out problem. You have two options -- compute the feed when a post is created (fan-out on write), or compute it when the feed is read (fan-out on read). Each has a fatal flaw at extreme scale. Instagram uses both.
The Real Design
┌──────────────────────────────────────────────────────────────────────┐
│ Upload Path │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Client │───>│ API Server│───>│ Post DB │ │ S3 (original)│ │
│ │ │ │ │───>│ (Postgres)│ │ │ │
│ │ │───>│ presigned │ └──────────┘ └──────┬───────┘ │
│ │ │ │ URL to S3 │ │ │
│ └──────────┘ └───────────┘ v │
│ ┌──────────────┐ │
│ │ Image Worker │ │
│ │ (resize to │ │
│ │ 4 variants) │ │
│ └──────┬───────┘ │
│ v │
│ ┌──────────────┐ │
│ │ S3 (variants) │ │
│ │ + CDN │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Fan-Out Path │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────────────────────────┐ │
│ │ Post DB │───>│ Celery │───>│ Redis Feed Cache │ │
│ │ (trigger)│ │ Workers │ │ feed:{user_id} = ZSET of │ │
│ └──────────┘ │ (fan-out) │ │ post_ids scored by timestamp │ │
│ └───────────┘ └──────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Read Path │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Client │───>│ API Server│───>│ Redis │───>│ Post Cache │ │
│ │ │<───│ (feed) │<───│ Feed │ │ (Memcached) │ │
│ │ │ │ │ │ Cache │ │ │ │
│ └──────────┘ └───────────┘ └──────────┘ └──────────────┘ │
│ │ │
│ │ miss │
│ v │
│ ┌──────────────┐ │
│ │ Post DB │ │
│ │ (Postgres) │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Upload Path
- Client sends
POST /postswith caption metadata to the API server. - API server writes post metadata to the Posts database (Postgres). The post status is set to
processing. - API server returns a presigned S3 URL to the client.
- Client uploads the original photo directly to S3 via the presigned URL. The API server never touches photo bytes.
- S3 event notification triggers an image processing worker.
- Worker generates 4 variants:
| Variant | Resolution | Use case | Size |
|---|---|---|---|
| Thumbnail | 150x150 px | Profile grid, search | ~15 KB |
| Feed | 640x640 px | Main feed on mobile | ~100 KB |
| Full resolution | 1080x1080 px | Full-screen view, web | ~300 KB |
| Original | As uploaded | Archival, future processing | Up to 8 MB |
- Worker stores all variants in S3 and updates the post status to
complete. - Once complete, the fan-out process begins.
Why presigned URLs? The API server should never proxy large binary uploads. Presigned URLs let the client upload directly to S3, keeping the API server free for metadata operations. The upload endpoint stays fast (<200ms) even while the user's photo is still transferring.
Fan-Out on Write (for Normal Users)
When a user with fewer than 100K followers posts a photo:
- Post is committed to the database.
- A Celery worker (or equivalent async task) fetches the poster's follower list.
- For each follower, the worker executes
ZADD feed:{follower_id} {timestamp} {post_id}in Redis. - The worker also trims the feed to the latest 1,000 posts:
ZREMRANGEBYRANK feed:{follower_id} 0 -1001.
A user with 500 followers generates 500 Redis writes. At sub-millisecond per write, this completes in under a second. The follower opens their app and the new post is already in their feed cache. No expensive JOIN at read time.
Fan-Out on Read (for Celebrities)
When a user with 500M followers posts a photo, fanning out to 500M Redis feed caches takes 83 minutes. By the time the last follower gets the update, the post is old news.
The fix: do not fan out celebrity posts at write time. Store them only in the posts table. At read time:
- Fetch the user's pre-computed feed from Redis (contains posts from non-celebrity accounts).
- Query the posts table for recent posts from the user's celebrity follows:
SELECT * FROM posts WHERE user_id IN (celebrity_ids) AND created_at > :cursor ORDER BY created_at DESC LIMIT 20. - Merge the two lists, sort by timestamp, return the top 20.
The celebrity query is fast because it hits a small number of user_id partitions (the user follows maybe 50 celebrities). It is only executed at read time for users who follow celebrities.
How does the system know who is a celebrity? Keep a set of celebrity user IDs in Redis, updated periodically by a background job that checks follower counts. The fan-out worker checks SISMEMBER celebrity_set {user_id} before deciding whether to fan out. This is O(1) and adds negligible latency.
The threshold is configurable. Instagram reportedly uses ~100K followers, but the exact number is not public and likely varies by other factors.
Feed Read Path
- Client requests
GET /feed?cursor={last_post_id}. - API server reads from Redis:
ZREVRANGEBYSCORE feed:{user_id} {cursor} -inf LIMIT 0 20. - This returns 20 post IDs.
- API server batch-fetches post metadata from Memcached (fallback to Postgres): post content, image URLs, like counts, comment counts.
- This two-step process is called feed hydration: first get the IDs, then fetch the full objects.
- API server returns the hydrated feed to the client. Image URLs point to the CDN.
- Client renders the feed. Images load from CDN edge servers (~10-50ms globally).
Cursor-based pagination (not offset pagination): The cursor is the last post ID the client saw. Because post IDs are time-sortable (more on this below), the cursor naturally anchors to a fixed point in the timeline. Unlike offset pagination, cursor pagination is immune to the problem of new posts shifting page boundaries.
Deep Dives

Deep Dive 1: ID Generation -- Instagram's Sharded IDs
Instagram needed IDs that are unique across all database shards, sortable by time, and fit in 64 bits. They rejected UUIDs (128 bits, poor index locality, not time-sortable) and centralized auto-increment (single point of failure). They also evaluated Twitter's Snowflake but rejected it because it required running a separate service -- and Instagram wanted to minimize moving parts.
What they built: a PL/pgSQL function that runs inside each PostgreSQL shard.
64-bit ID layout:
- 41 bits for timestamp: ~69 years of IDs (until ~2080). Using a custom epoch instead of Unix epoch (1970) maximizes the usable range.
- 13 bits for shard ID: Up to 8,192 logical shards. Hardcoded per shard in the PL/pgSQL function.
- 10 bits for sequence: Up to 1,024 IDs per millisecond per shard. Uses a PostgreSQL sequence that wraps around (modulo 1024).
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;
Why this matters for the interview:
- Sorting by ID = sorting by time. This enables cursor-based pagination where the cursor is just a post ID. No separate
created_atindex needed. - Shard routing from the ID itself. Given any post ID, extract the shard:
shard_id = (id >> 10) & 0x1FFF. Route reads to the correct shard with bitwise operations. No lookup table. - No central coordinator. Each shard generates IDs independently. A shard going down only affects that shard's ID generation, not the entire system.
- 64-bit vs 128-bit UUIDs. Half the storage per ID. Better B-tree locality. Fits in a machine word for comparisons. At billions of rows, this adds up.
Compare with Twitter Snowflake:
| Aspect | Instagram Sharded IDs | Twitter Snowflake |
|---|---|---|
| Infrastructure | None (runs inside PostgreSQL) | Separate Thrift service cluster |
| Coordination | None | ZooKeeper for worker IDs |
| Sequence per ms | 1,024 per shard | 4,096 per worker |
| Failure mode | Shard down = that shard stops | Snowflake down = no IDs anywhere |
I'd use Instagram's approach in an interview because it requires no new infrastructure. You already have PostgreSQL. The ID generation function lives in the database itself.
Deep Dive 2: Photo Storage and CDN
The photo storage layer is where this system burns money. 140 TB of new photo data per day. 51 PB per year. The CDN is not an optimization -- it is a survival requirement.
Why object storage, not filesystem
Traditional filesystems (POSIX, NFS) store each photo as a separate file. Each read requires 3 disk I/O operations: directory metadata lookup, inode read, file data read. When you have hundreds of billions of photos, the filesystem metadata (inodes, directory entries) cannot fit in RAM. Metadata I/O becomes the bottleneck, not data I/O.
Facebook's Haystack solved this with a fundamentally different approach. Multiple photos are appended sequentially into large "volume" files (~100 GB each). An in-memory index maps (photo_id, size_variant) to (volume_id, offset, data_size). Each index entry is ~40 bytes (vs ~500 bytes for a filesystem inode). Result: 1 disk I/O per photo read instead of 3.
For an interview, use S3 as the blob store. Mention Haystack as evidence that companies eventually build custom storage at extreme scale, but do not design a custom storage system in the interview unless asked.
CDN architecture
User (Singapore) --> Edge Server (Singapore PoP) --> Origin (S3, US-East)
[cache hit: ~10ms] [cache miss: ~200ms]
The CDN is pull-based: the first user requesting an image causes the edge server to fetch it from S3 and cache it. Subsequent requests from users near that edge server get cache hits at ~10ms.
Key numbers:
| Metric | Without CDN | With CDN |
|---|---|---|
| Image load latency (global) | 200-500ms | 10-50ms |
| Origin server load | 100% of requests | ~20% (80% cache hit) |
| Bandwidth cost | Full S3 egress | Edge-served |
| Daily origin egress | ~2.5 PB | ~500 TB |
Photos are immutable (once uploaded, the bytes never change), so CDN caching is trivially correct. Set long TTL (30 days). When a post is deleted, purge the CDN cache entries.
Cost optimization: Move original-resolution copies to S3 Glacier after 30 days (from $0.023/GB/month to $0.004/GB/month). Keep thumbnail and feed variants in S3 Standard since those get the most reads.
Deep Dive 3: The Celebrity Fan-Out Problem in Detail
This is where the interview gets interesting. Let me walk through the math more carefully.
The problem in numbers:
Cristiano Ronaldo posts a photo. 620M followers.
Fan-out on write: 620M Redis ZADD operations.
Best case: 100K writes/sec per Redis node, 10 nodes = 1M writes/sec
620M / 1M = 620 seconds = ~10 minutes
By the time the last follower sees the post, 10 minutes have passed.
But half of Instagram is following Ronaldo.
Those users expect to see his post within seconds.
Pure fan-out on write is impossible for celebrities. But pure fan-out on read has its own problem: if you compute the entire feed at read time for every user, you are back to expensive JOINs at 58,000 reads/sec.
The hybrid approach:
- Set a threshold: users with >100K followers are "celebrities."
- For normal users (<100K followers): fan-out on write. Push post IDs to all followers' Redis feed caches.
- For celebrities: do NOT fan out. Store the post only in the posts table.
- At feed read time: merge the pre-computed feed (from Redis) with recent celebrity posts (queried from the database).
Feed(user) = merge(
Redis.ZREVRANGE("feed:{user_id}", 0, 19), # pre-computed, normal posts
DB.query("SELECT id FROM posts
WHERE user_id IN (:celebrity_follows)
AND created_at > :1_day_ago
ORDER BY created_at DESC
LIMIT 20") # on-demand, celebrity posts
)
# sort merged results by timestamp, take top 20
Why this works: The average user follows maybe 30-50 celebrities. Querying 50 user partitions for recent posts is fast -- each partition lookup is an index scan on (user_id, created_at). The pre-computed feed covers the other 950 normal accounts the user follows. Merging two sorted lists of 20 items each is O(n) and takes microseconds.
Inactive user optimization: Do not maintain pre-computed feeds for users who have not opened the app in the last hour. When they return, compute their feed on demand (which is slower -- maybe 500ms instead of 50ms) and start maintaining the cache again. This saves enormous amounts of Redis memory. If 80% of registered users are inactive on any given day, this cuts feed cache Redis usage by 80%.
Alternative Designs
Alternative 1: Full Fan-Out on Read
Skip Redis feed caches entirely. At read time, fetch the user's follow list, query recent posts from all followed users, merge and sort.
This works if you have a database that handles 58,000 complex queries per second. DynamoDB with a partition key of user_id and sort key of post_id (which is time-sortable) can handle this -- each user's posts are a single partition scan. With enough provisioned capacity, this scales linearly.
Trade-off: higher read latency (100-200ms vs 20-50ms for pre-computed feeds), no pre-computation cost, simpler architecture.
Alternative 2: Change Data Capture for Fan-Out
Instead of the application explicitly enqueuing fan-out tasks, use CDC (Change Data Capture) from the posts table. When a row is inserted into the posts table, a CDC consumer (Debezium, DynamoDB Streams) automatically triggers the fan-out worker.
Trade-off: decouples the post service from the fan-out pipeline (post service does not need to know about fan-out), ensures atomicity (if the post insert succeeds, the CDC event is guaranteed), but adds operational complexity (another system to monitor).
| Aspect | Fan-Out on Write (Redis) | Full Fan-Out on Read | CDC-Triggered Fan-Out |
|---|---|---|---|
| Feed read latency | 20-50ms | 100-200ms | 20-50ms |
| Write amplification | High (N writes per post) | None | High (same as fan-out) |
| Celebrity handling | Needs hybrid approach | Naturally handled | Needs hybrid approach |
| Operational complexity | Medium (Redis cluster) | Low (just a database) | High (CDC pipeline) |
| Best for | Instagram-scale social | Smaller scale, read-heavy | Event-driven architectures |
Scaling Math Verification
Fan-out writes (580K Redis writes/sec for normal users):
- A Redis Cluster with 10 nodes handles 1M+ writes/sec.
- Our 580K writes/sec uses ~58% of cluster capacity.
- Comfortable with room for spikes.
Feed reads (58K reads/sec):
- Redis reads from feed cache: ~50K reads/sec (after some requests go to on-demand celebrity merge).
- Redis handles 100K+ reads/sec per node. A 5-node cluster handles this.
- Memcached for post hydration: 58K reads/sec * 20 posts/read = 1.16M key lookups/sec.
- Memcached handles 1M+ reads/sec per node. 2-3 nodes with hot-key sharding.
Photo upload processing (1,160 uploads/sec):
- Each upload triggers 3 resize operations (thumbnail + feed + full).
- 1,160 * 3 = 3,480 image processing tasks/sec.
- Each resize takes ~200ms of CPU time.
- Workers needed: 3,480 * 0.2 = ~700 concurrent workers.
- With 8-core machines, ~90 machines for the image processing fleet.
CDN egress (2.5 PB/day, 80% cache hit):
- Origin egress: 500 TB/day = ~46 Gbps sustained from S3.
- S3 handles this. CDN handles the rest from edge caches.
- CDN cost at ~$0.02/GB: $50K/day = $18M/year (this is why Instagram negotiates custom CDN contracts).
Failure Analysis
What breaks at 10x (1B DAU, 1B posts/day)?
| Component | Current | At 10x | Breaks? | Fix |
|---|---|---|---|---|
| Post database | ~12 TB/year metadata | ~128 TB/year | Maybe | More Postgres shards, archive old data |
| Photo storage (S3) | 51 PB/year | 510 PB/year | Cost | Haystack-style custom storage |
| Redis feed cache | ~5 node cluster | ~50 node cluster | Cost | Inactive user eviction, cold storage for feeds |
| Fan-out workers | 580K writes/sec | 5.8M writes/sec | Yes | More aggressive celebrity threshold, batching |
| Image processing | 90 machines | 900 machines | Cost | GPU acceleration, smarter encoding |
| CDN | $18M/year | $180M/year | Cost | Build edge infrastructure (like Facebook) |
The first things to break at 10x are the fan-out write rate and cost. At 5.8M Redis writes/sec, you need a very large Redis Cluster. The fix: lower the celebrity threshold (from 100K to 10K followers), which moves more users to fan-out-on-read and reduces write amplification at the cost of slightly higher read latency.
What if Redis goes down?
The feed cache is lost, but no data is lost permanently. Posts live in PostgreSQL. The system falls back to fan-out on read (the slow path) while the Redis cluster recovers. Feed latency increases from 50ms to 200ms. Users notice slower feeds but the app does not break.
For durability of the feed cache, enable Redis AOF (Append Only File) persistence and Redis Sentinel for automatic failover. The feed cache is reconstructable from the database, so losing it is an availability issue, not a durability issue.
What if a celebrity deletes a post?
Lazy deletion: Do not fan out a delete to 500M feed caches. Instead, when hydrating the feed (step 4 in the read path), check each post's status. If it is deleted, skip it. The deleted post ID naturally ages out of feed caches as new posts push it out (the ZREMRANGEBYRANK trim keeps feeds at 1,000 entries max).
This is simpler and cheaper than fanning out 500M delete operations.
What's Expected at Each Level
| Aspect | Mid-Level | Senior | Staff+ |
|---|---|---|---|
| High-level design | Upload + feed + basic components | Separate upload, fan-out, and read paths | Complete data flow with async processing pipeline |
| Feed generation | "Query the database" | Fan-out on write with Redis | Hybrid: fan-out on write + read for celebrities, with threshold justification and math |
| Photo storage | "Store in S3" | S3 + CDN, presigned URL upload | Image processing pipeline (4 variants), Haystack for context, storage tiering (Glacier) |
| ID generation | "Use auto-increment" or UUID | Mention need for distributed unique IDs | Instagram's sharded IDs (41+13+10 bit layout), compare with Snowflake, explain cursor pagination |
| Caching | "Add a cache" | Feed cache (Redis ZSET) + post cache (Memcached) | Two-cache architecture, feed hydration, inactive user eviction |
| Fan-out math | Not mentioned | "Celebrities are a problem" | Calculate: 500M followers / 100K writes per sec = 83 min. Derive the hybrid threshold. |
| Pagination | Offset-based | Cursor-based | Explain why offset breaks with real-time inserts, cursor = post ID because IDs are time-sortable |
| Likes/comments | "UPDATE likes = likes + 1" | Acknowledge contention problem | Redis INCR + async flush, sharded counters for viral posts |
| Scale awareness | "Add more servers" | Identify fan-out as the bottleneck | Full back-of-envelope: posts/sec, fan-out writes/sec, CDN egress, storage per year |
The signal interviewers look for at each level: mid-level candidates should complete the basic design. Senior candidates should identify and solve the celebrity fan-out problem. Staff candidates should prove with math why the hybrid approach is necessary and show they understand the cost implications at scale.
References from Our Courses
- Redis Data Structures and Use Cases — fan-out queues and counters for likes/comments
- Kafka Partitions and Ordering — ordered feed events for timeline generation
- Partition and Clustering Keys — Cassandra data model for time-sorted feed storage
- Redis Interview Patterns — sharded counters for viral post engagement
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.