Design a Photo-Sharing Platform

If the URL shortener is a database problem and Dropbox is a synchronization problem, Instagram is a fan-out problem. The technical crux is not uploading photos or storing them -- that is solved infrastructure. The crux is delivering a personalized feed to 500 million people every day, including the moment a celebrity with 500 million followers posts a photo. That single post needs to appear in half a billion feeds. How fast can you do that?

The System

A photo-sharing platform lets users upload photos, follow other users, and view a feed of photos from people they follow. Users can like and comment on posts. The feed is the core product -- it is where users spend 90% of their time.

Instagram launched in 2010 and hit 30 million users within 18 months. At the time of its $1 billion acquisition by Facebook in April 2012, the entire backend was run by 3 engineers. Three. The stack was Django, PostgreSQL, Redis, S3, CloudFront, Celery, and Memcached -- all running on about 100 Amazon EC2 instances. No DynamoDB. No custom storage. No microservices. Just a well-sharded Postgres cluster and smart use of Redis for the feed cache. The "SQL can't scale" narrative dies on contact with Instagram's actual architecture.

By 2016, Instagram was serving 500 million monthly active users on fundamentally the same architecture, scaled up and absorbed into Facebook's infrastructure. The photo storage migrated to Haystack (Facebook's custom storage system), but the metadata layer remained PostgreSQL-based.

Requirements

Functional

Post photos: Users upload photos with captions (max 8 MB per image)
Follow users: Asymmetric follow model (you can follow someone without them following you back)
View feed: See a reverse-chronological (or ranked) feed of posts from followed users
Like/comment: Interact with posts (secondary, but commonly asked as follow-up)

Non-Functional

Feed latency: Feed loads in under 200ms at the 99th percentile (users are impatient -- Instagram's feed must feel instant)
Upload latency: Photo upload completes (from user perspective) in under 2 seconds, even though background processing continues
Write throughput: 100M new posts per day (~1,160 posts/sec)
Read throughput: 500M DAU, each opening the feed ~10 times/day = 5B feed reads/day (~58,000 reads/sec)
Storage: Store all photos permanently (no TTL)
Availability: 99.99% -- feed downtime means users open a competitor's app

Back-of-Envelope Math

Post volume:
  100M posts/day
  100,000,000 / 86,400 = ~1,160 posts/sec average
  Peak (5x): ~5,800 posts/sec

Feed reads:
  500M DAU * 10 opens/day = 5B reads/day
  5,000,000,000 / 86,400 = ~58,000 reads/sec average
  Peak (3x): ~174,000 reads/sec

Photo storage:
  Each photo: 4 variants (thumbnail 150px ~15KB, feed 640px ~100KB,
              full 1080px ~300KB, original ~1MB)
  Total per photo: ~1.4 MB across all variants
  Daily: 100M * 1.4 MB = 140 TB/day
  Annual: 140 TB * 365 = ~51 PB/year

Metadata storage:
  Each post: post_id (8B) + user_id (8B) + caption (200B) +
             image_url (100B) + timestamps (16B) + counters (16B) = ~350 bytes
  Daily: 100M * 350B = 35 GB/day
  Annual: ~12.8 TB/year
  Much smaller than photo storage. Database is not the problem.

Fan-out math (the hard one):
  Average user has 500 followers
  Average post fans out to 500 feed caches
  1,160 posts/sec * 500 followers = 580,000 Redis writes/sec for fan-out

  Celebrity post (500M followers):
  At 100K Redis writes/sec per worker = 500,000,000 / 100,000 = 5,000 seconds = 83 minutes
  A single celebrity post takes 83 minutes to fan out. Obviously unacceptable.

CDN bandwidth:
  500M DAU * 50 images/session * 100KB average = ~2.5 PB egress/day from origin
  With CDN (80% cache hit): origin serves ~500 TB/day
  Edge serves ~2 PB/day at <50ms latency globally

The Naive Design

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────┐
│  Client   │────>│  API Server  │────>│  PostgreSQL   │     │   S3     │
│           │<────│              │<────│              │     │  (photos)│
└──────────┘     └──────────────┘     └──────────────┘     └──────────┘

POST /posts (upload photo + caption)
  -> Store photo in S3
  -> INSERT INTO posts (user_id, image_url, caption)

GET /feed
  -> SELECT p.* FROM posts p
     JOIN follows f ON f.following_id = p.user_id
     WHERE f.follower_id = :current_user
     ORDER BY p.created_at DESC
     LIMIT 20

Clean. Simple. Falls apart at scale.

Where Does This Break First?

That feed query is a JOIN between the follows table (billions of rows) and the posts table (tens of billions of rows), filtered by a user's follow list, sorted by time, and limited to 20 results. Even with good indexes, this query touches too many partitions when a user follows 1,000 accounts. At 58,000 feed reads/sec, you are running 58,000 of these expensive JOINs per second. The database dies.

Where It Breaks

The feed query is the bottleneck. A user following 1,000 accounts forces the database to scan recent posts from 1,000 different users, merge-sort them by time, and return the top 20. With 58,000 of these queries per second, no amount of read replicas saves you.

The fundamental insight: computing the feed at read time is too expensive. You must pre-compute it.

This is the fan-out problem. You have two options -- compute the feed when a post is created (fan-out on write), or compute it when the feed is read (fan-out on read). Each has a fatal flaw at extreme scale. Instagram uses both.

The Real Design

┌──────────────────────────────────────────────────────────────────────┐
│                              Upload Path                             │
│                                                                      │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────────┐  │
│  │  Client   │───>│ API Server│───>│ Post DB  │    │ S3 (original)│  │
│  │           │    │           │───>│ (Postgres)│    │              │  │
│  │           │───>│ presigned │    └──────────┘    └──────┬───────┘  │
│  │           │    │ URL to S3 │                           │          │
│  └──────────┘    └───────────┘                           v          │
│                                                   ┌──────────────┐  │
│                                                   │ Image Worker  │  │
│                                                   │ (resize to    │  │
│                                                   │  4 variants)  │  │
│                                                   └──────┬───────┘  │
│                                                          v          │
│                                                   ┌──────────────┐  │
│                                                   │ S3 (variants) │  │
│                                                   │ + CDN         │  │
│                                                   └──────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────┐
│                             Fan-Out Path                             │
│                                                                      │
│  ┌──────────┐    ┌───────────┐    ┌──────────────────────────────┐   │
│  │ Post DB  │───>│  Celery   │───>│  Redis Feed Cache            │   │
│  │ (trigger)│    │  Workers  │    │  feed:{user_id} = ZSET of    │   │
│  └──────────┘    │ (fan-out) │    │  post_ids scored by timestamp │   │
│                  └───────────┘    └──────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────┐
│                              Read Path                               │
│                                                                      │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────────┐  │
│  │  Client   │───>│ API Server│───>│  Redis   │───>│ Post Cache   │  │
│  │           │<───│ (feed)    │<───│  Feed    │    │ (Memcached)  │  │
│  │           │    │           │    │  Cache   │    │              │  │
│  └──────────┘    └───────────┘    └──────────┘    └──────────────┘  │
│                                                          │          │
│                                                          │ miss     │
│                                                          v          │
│                                                   ┌──────────────┐  │
│                                                   │  Post DB     │  │
│                                                   │  (Postgres)  │  │
│                                                   └──────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Upload Path

Client sends POST /posts with caption metadata to the API server.
API server writes post metadata to the Posts database (Postgres). The post status is set to processing.
API server returns a presigned S3 URL to the client.
Client uploads the original photo directly to S3 via the presigned URL. The API server never touches photo bytes.
S3 event notification triggers an image processing worker.
Worker generates 4 variants:

Variant	Resolution	Use case	Size
Thumbnail	150x150 px	Profile grid, search	~15 KB
Feed	640x640 px	Main feed on mobile	~100 KB
Full resolution	1080x1080 px	Full-screen view, web	~300 KB
Original	As uploaded	Archival, future processing	Up to 8 MB

Worker stores all variants in S3 and updates the post status to complete.
Once complete, the fan-out process begins.

Why presigned URLs? The API server should never proxy large binary uploads. Presigned URLs let the client upload directly to S3, keeping the API server free for metadata operations. The upload endpoint stays fast (<200ms) even while the user's photo is still transferring.

Fan-Out on Write (for Normal Users)

When a user with fewer than 100K followers posts a photo:

Post is committed to the database.
A Celery worker (or equivalent async task) fetches the poster's follower list.
For each follower, the worker executes ZADD feed:{follower_id} {timestamp} {post_id} in Redis.
The worker also trims the feed to the latest 1,000 posts: ZREMRANGEBYRANK feed:{follower_id} 0 -1001.

A user with 500 followers generates 500 Redis writes. At sub-millisecond per write, this completes in under a second. The follower opens their app and the new post is already in their feed cache. No expensive JOIN at read time.

Fan-Out on Read (for Celebrities)

When a user with 500M followers posts a photo, fanning out to 500M Redis feed caches takes 83 minutes. By the time the last follower gets the update, the post is old news.

The fix: do not fan out celebrity posts at write time. Store them only in the posts table. At read time:

Fetch the user's pre-computed feed from Redis (contains posts from non-celebrity accounts).
Query the posts table for recent posts from the user's celebrity follows: SELECT * FROM posts WHERE user_id IN (celebrity_ids) AND created_at > :cursor ORDER BY created_at DESC LIMIT 20.
Merge the two lists, sort by timestamp, return the top 20.

The celebrity query is fast because it hits a small number of user_id partitions (the user follows maybe 50 celebrities). It is only executed at read time for users who follow celebrities.

How does the system know who is a celebrity? Keep a set of celebrity user IDs in Redis, updated periodically by a background job that checks follower counts. The fan-out worker checks SISMEMBER celebrity_set {user_id} before deciding whether to fan out. This is O(1) and adds negligible latency.

The threshold is configurable. Instagram reportedly uses ~100K followers, but the exact number is not public and likely varies by other factors.

Feed Read Path

Client requests GET /feed?cursor={last_post_id}.
API server reads from Redis: ZREVRANGEBYSCORE feed:{user_id} {cursor} -inf LIMIT 0 20.
This returns 20 post IDs.
API server batch-fetches post metadata from Memcached (fallback to Postgres): post content, image URLs, like counts, comment counts.
This two-step process is called feed hydration: first get the IDs, then fetch the full objects.
API server returns the hydrated feed to the client. Image URLs point to the CDN.
Client renders the feed. Images load from CDN edge servers (~10-50ms globally).

Cursor-based pagination (not offset pagination): The cursor is the last post ID the client saw. Because post IDs are time-sortable (more on this below), the cursor naturally anchors to a fixed point in the timeline. Unlike offset pagination, cursor pagination is immune to the problem of new posts shifting page boundaries.

Deep Dives

Photo Sharing Platform — Photo Sharing High-Level Design

Deep Dive 1: ID Generation -- Instagram's Sharded IDs

Instagram needed IDs that are unique across all database shards, sortable by time, and fit in 64 bits. They rejected UUIDs (128 bits, poor index locality, not time-sortable) and centralized auto-increment (single point of failure). They also evaluated Twitter's Snowflake but rejected it because it required running a separate service -- and Instagram wanted to minimize moving parts.

What they built: a PL/pgSQL function that runs inside each PostgreSQL shard.

64-bit ID layout:

| 41 bits: ms since custom epoch (Aug 24, 2011) | 13 bits: shard ID | 10 bits: sequence |

41 bits for timestamp: ~69 years of IDs (until ~2080). Using a custom epoch instead of Unix epoch (1970) maximizes the usable range.
13 bits for shard ID: Up to 8,192 logical shards. Hardcoded per shard in the PL/pgSQL function.
10 bits for sequence: Up to 1,024 IDs per millisecond per shard. Uses a PostgreSQL sequence that wraps around (modulo 1024).

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;
BEGIN
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;

Why this matters for the interview:

Sorting by ID = sorting by time. This enables cursor-based pagination where the cursor is just a post ID. No separate created_at index needed.
Shard routing from the ID itself. Given any post ID, extract the shard: shard_id = (id >> 10) & 0x1FFF. Route reads to the correct shard with bitwise operations. No lookup table.
No central coordinator. Each shard generates IDs independently. A shard going down only affects that shard's ID generation, not the entire system.
64-bit vs 128-bit UUIDs. Half the storage per ID. Better B-tree locality. Fits in a machine word for comparisons. At billions of rows, this adds up.

Compare with Twitter Snowflake:

Aspect	Instagram Sharded IDs	Twitter Snowflake
Infrastructure	None (runs inside PostgreSQL)	Separate Thrift service cluster
Coordination	None	ZooKeeper for worker IDs
Sequence per ms	1,024 per shard	4,096 per worker
Failure mode	Shard down = that shard stops	Snowflake down = no IDs anywhere

I'd use Instagram's approach in an interview because it requires no new infrastructure. You already have PostgreSQL. The ID generation function lives in the database itself.

Deep Dive 2: Photo Storage and CDN

The photo storage layer is where this system burns money. 140 TB of new photo data per day. 51 PB per year. The CDN is not an optimization -- it is a survival requirement.

Why object storage, not filesystem

Traditional filesystems (POSIX, NFS) store each photo as a separate file. Each read requires 3 disk I/O operations: directory metadata lookup, inode read, file data read. When you have hundreds of billions of photos, the filesystem metadata (inodes, directory entries) cannot fit in RAM. Metadata I/O becomes the bottleneck, not data I/O.

Facebook's Haystack solved this with a fundamentally different approach. Multiple photos are appended sequentially into large "volume" files (~100 GB each). An in-memory index maps (photo_id, size_variant) to (volume_id, offset, data_size). Each index entry is ~40 bytes (vs ~500 bytes for a filesystem inode). Result: 1 disk I/O per photo read instead of 3.

For an interview, use S3 as the blob store. Mention Haystack as evidence that companies eventually build custom storage at extreme scale, but do not design a custom storage system in the interview unless asked.

CDN architecture

User (Singapore)  -->  Edge Server (Singapore PoP)  -->  Origin (S3, US-East)
                       [cache hit: ~10ms]               [cache miss: ~200ms]

The CDN is pull-based: the first user requesting an image causes the edge server to fetch it from S3 and cache it. Subsequent requests from users near that edge server get cache hits at ~10ms.

Key numbers:

Metric	Without CDN	With CDN
Image load latency (global)	200-500ms	10-50ms
Origin server load	100% of requests	~20% (80% cache hit)
Bandwidth cost	Full S3 egress	Edge-served
Daily origin egress	~2.5 PB	~500 TB

Photos are immutable (once uploaded, the bytes never change), so CDN caching is trivially correct. Set long TTL (30 days). When a post is deleted, purge the CDN cache entries.

Cost optimization: Move original-resolution copies to S3 Glacier after 30 days (from $0.023/GB/month to $0.004/GB/month). Keep thumbnail and feed variants in S3 Standard since those get the most reads.

Deep Dive 3: The Celebrity Fan-Out Problem in Detail

This is where the interview gets interesting. Let me walk through the math more carefully.

The problem in numbers:

Cristiano Ronaldo posts a photo. 620M followers.
Fan-out on write: 620M Redis ZADD operations.

Best case: 100K writes/sec per Redis node, 10 nodes = 1M writes/sec
620M / 1M = 620 seconds = ~10 minutes

By the time the last follower sees the post, 10 minutes have passed.
But half of Instagram is following Ronaldo. 
Those users expect to see his post within seconds.

Pure fan-out on write is impossible for celebrities. But pure fan-out on read has its own problem: if you compute the entire feed at read time for every user, you are back to expensive JOINs at 58,000 reads/sec.

The hybrid approach:

Set a threshold: users with >100K followers are "celebrities."
For normal users (<100K followers): fan-out on write. Push post IDs to all followers' Redis feed caches.
For celebrities: do NOT fan out. Store the post only in the posts table.
At feed read time: merge the pre-computed feed (from Redis) with recent celebrity posts (queried from the database).

Feed(user) = merge(
    Redis.ZREVRANGE("feed:{user_id}", 0, 19),     # pre-computed, normal posts
    DB.query("SELECT id FROM posts 
              WHERE user_id IN (:celebrity_follows) 
              AND created_at > :1_day_ago
              ORDER BY created_at DESC 
              LIMIT 20")                             # on-demand, celebrity posts
)
# sort merged results by timestamp, take top 20

Why this works: The average user follows maybe 30-50 celebrities. Querying 50 user partitions for recent posts is fast -- each partition lookup is an index scan on (user_id, created_at). The pre-computed feed covers the other 950 normal accounts the user follows. Merging two sorted lists of 20 items each is O(n) and takes microseconds.

Inactive user optimization: Do not maintain pre-computed feeds for users who have not opened the app in the last hour. When they return, compute their feed on demand (which is slower -- maybe 500ms instead of 50ms) and start maintaining the cache again. This saves enormous amounts of Redis memory. If 80% of registered users are inactive on any given day, this cuts feed cache Redis usage by 80%.

Alternative Designs

Alternative 1: Full Fan-Out on Read

Skip Redis feed caches entirely. At read time, fetch the user's follow list, query recent posts from all followed users, merge and sort.

This works if you have a database that handles 58,000 complex queries per second. DynamoDB with a partition key of user_id and sort key of post_id (which is time-sortable) can handle this -- each user's posts are a single partition scan. With enough provisioned capacity, this scales linearly.

Trade-off: higher read latency (100-200ms vs 20-50ms for pre-computed feeds), no pre-computation cost, simpler architecture.

Alternative 2: Change Data Capture for Fan-Out

Instead of the application explicitly enqueuing fan-out tasks, use CDC (Change Data Capture) from the posts table. When a row is inserted into the posts table, a CDC consumer (Debezium, DynamoDB Streams) automatically triggers the fan-out worker.

Trade-off: decouples the post service from the fan-out pipeline (post service does not need to know about fan-out), ensures atomicity (if the post insert succeeds, the CDC event is guaranteed), but adds operational complexity (another system to monitor).

Aspect	Fan-Out on Write (Redis)	Full Fan-Out on Read	CDC-Triggered Fan-Out
Feed read latency	20-50ms	100-200ms	20-50ms
Write amplification	High (N writes per post)	None	High (same as fan-out)
Celebrity handling	Needs hybrid approach	Naturally handled	Needs hybrid approach
Operational complexity	Medium (Redis cluster)	Low (just a database)	High (CDC pipeline)
Best for	Instagram-scale social	Smaller scale, read-heavy	Event-driven architectures

Scaling Math Verification

Fan-out writes (580K Redis writes/sec for normal users):

A Redis Cluster with 10 nodes handles 1M+ writes/sec.
Our 580K writes/sec uses ~58% of cluster capacity.
Comfortable with room for spikes.

Feed reads (58K reads/sec):

Redis reads from feed cache: ~50K reads/sec (after some requests go to on-demand celebrity merge).
Redis handles 100K+ reads/sec per node. A 5-node cluster handles this.
Memcached for post hydration: 58K reads/sec * 20 posts/read = 1.16M key lookups/sec.
Memcached handles 1M+ reads/sec per node. 2-3 nodes with hot-key sharding.

Photo upload processing (1,160 uploads/sec):

Each upload triggers 3 resize operations (thumbnail + feed + full).
1,160 * 3 = 3,480 image processing tasks/sec.
Each resize takes ~200ms of CPU time.
Workers needed: 3,480 * 0.2 = ~700 concurrent workers.
With 8-core machines, ~90 machines for the image processing fleet.

CDN egress (2.5 PB/day, 80% cache hit):

Origin egress: 500 TB/day = ~46 Gbps sustained from S3.
S3 handles this. CDN handles the rest from edge caches.
CDN cost at ~$0.02/GB: $50K/day = $18M/year (this is why Instagram negotiates custom CDN contracts).

Failure Analysis

What breaks at 10x (1B DAU, 1B posts/day)?

Component	Current	At 10x	Breaks?	Fix
Post database	~12 TB/year metadata	~128 TB/year	Maybe	More Postgres shards, archive old data
Photo storage (S3)	51 PB/year	510 PB/year	Cost	Haystack-style custom storage
Redis feed cache	~5 node cluster	~50 node cluster	Cost	Inactive user eviction, cold storage for feeds
Fan-out workers	580K writes/sec	5.8M writes/sec	Yes	More aggressive celebrity threshold, batching
Image processing	90 machines	900 machines	Cost	GPU acceleration, smarter encoding
CDN	$18M/year	$180M/year	Cost	Build edge infrastructure (like Facebook)

The first things to break at 10x are the fan-out write rate and cost. At 5.8M Redis writes/sec, you need a very large Redis Cluster. The fix: lower the celebrity threshold (from 100K to 10K followers), which moves more users to fan-out-on-read and reduces write amplification at the cost of slightly higher read latency.

What if Redis goes down?

The feed cache is lost, but no data is lost permanently. Posts live in PostgreSQL. The system falls back to fan-out on read (the slow path) while the Redis cluster recovers. Feed latency increases from 50ms to 200ms. Users notice slower feeds but the app does not break.

For durability of the feed cache, enable Redis AOF (Append Only File) persistence and Redis Sentinel for automatic failover. The feed cache is reconstructable from the database, so losing it is an availability issue, not a durability issue.

What if a celebrity deletes a post?

Lazy deletion: Do not fan out a delete to 500M feed caches. Instead, when hydrating the feed (step 4 in the read path), check each post's status. If it is deleted, skip it. The deleted post ID naturally ages out of feed caches as new posts push it out (the ZREMRANGEBYRANK trim keeps feeds at 1,000 entries max).

This is simpler and cheaper than fanning out 500M delete operations.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
High-level design	Upload + feed + basic components	Separate upload, fan-out, and read paths	Complete data flow with async processing pipeline
Feed generation	"Query the database"	Fan-out on write with Redis	Hybrid: fan-out on write + read for celebrities, with threshold justification and math
Photo storage	"Store in S3"	S3 + CDN, presigned URL upload	Image processing pipeline (4 variants), Haystack for context, storage tiering (Glacier)
ID generation	"Use auto-increment" or UUID	Mention need for distributed unique IDs	Instagram's sharded IDs (41+13+10 bit layout), compare with Snowflake, explain cursor pagination
Caching	"Add a cache"	Feed cache (Redis ZSET) + post cache (Memcached)	Two-cache architecture, feed hydration, inactive user eviction
Fan-out math	Not mentioned	"Celebrities are a problem"	Calculate: 500M followers / 100K writes per sec = 83 min. Derive the hybrid threshold.
Pagination	Offset-based	Cursor-based	Explain why offset breaks with real-time inserts, cursor = post ID because IDs are time-sortable
Likes/comments	"UPDATE likes = likes + 1"	Acknowledge contention problem	Redis INCR + async flush, sharded counters for viral posts
Scale awareness	"Add more servers"	Identify fan-out as the bottleneck	Full back-of-envelope: posts/sec, fan-out writes/sec, CDN egress, storage per year

The signal interviewers look for at each level: mid-level candidates should complete the basic design. Senior candidates should identify and solve the celebrity fan-out problem. Staff candidates should prove with math why the hybrid approach is necessary and show they understand the cost implications at scale.

References from Our Courses

Redis Data Structures and Use Cases — fan-out queues and counters for likes/comments
Kafka Partitions and Ordering — ordered feed events for timeline generation
Partition and Clustering Keys — Cassandra data model for time-sorted feed storage
Redis Interview Patterns — sharded counters for viral post engagement

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Photo-Sharing Platform →