Design a Ticket Booking System
TL;DR
Build a system that sells fixed-inventory seats to millions of concurrent users without overselling, double-booking, or melting under load. The core pattern is a two-tier event model: 99% of events (local bands, small conferences) have zero contention and can be booked with simple optimistic locking, while 1% of events (Taylor Swift, World Cup) need a completely different architecture -- virtual waiting rooms with pre-queue randomization, CDN-served static pages during the crush, and seat reservations with TTL to handle payment abandonment. The Taylor Swift Eras Tour presale saw 14 million users hitting Ticketmaster simultaneously, generating 3.5 billion system requests. The system did not handle it well. Your job is to design one that would.
The System
Ticketmaster, BookMyShow, StubHub. A venue has 70,000 seats. An event goes on sale at 10:00 AM. At 9:59 AM, 500,000 users are refreshing the page. At 10:00:00, they all click "Buy Tickets" simultaneously. You have 70,000 seats and 500,000 buyers. 430,000 people must be told "no" quickly and gracefully. The 70,000 who get through must each get exactly one seat, with no double-bookings, and their payment must complete within 10 minutes or the seat is released.
Why is this harder than, say, selling shoes on Amazon? Because of the inventory model. Amazon has elastic inventory -- if a product sells out, they restock. A concert seat is a one-time, non-fungible item. Seat A7 in Row 3 exists exactly once. Two people cannot both buy it. And the entire inventory sells out in minutes, not days. This is a thundering herd problem with hard consistency requirements on a rapidly depleting, finite inventory.
Requirements
Functional Requirements
| Requirement | Details |
|---|---|
| Event creation | Venues create events with seating charts, pricing tiers, sale start times. |
| Seat browsing | Users view available seats on an interactive map. |
| Seat reservation | User selects seats and has a time-limited hold (10 minutes) to complete payment. |
| Payment processing | Integrate with payment gateway. Confirm booking on success, release on failure/timeout. |
| Ticket delivery | Generate QR-coded e-tickets. Send via email and in-app. |
| Waitlist | When sold out, users can join a waitlist for cancellations. |
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Reservation latency | < 500 ms for seat hold |
| Payment completion window | 10 minutes (configurable per event) |
| Consistency | Zero overselling. Every seat sold to exactly one buyer. |
| Availability | 99.9% for normal events. Graceful degradation for high-demand events. |
| Scale | 50M users, 100K concurrent for top-tier events, 3.5B requests/day peak |
Back-of-Envelope Math
Normal event (99% of events):
Venue capacity: 500-5,000 seats
Concurrent buyers: 100-1,000
Requests/sec: 50-500
Duration of sale: hours to days
Contention: near zero
High-demand event (1% of events):
Venue capacity: 70,000 seats (stadium)
Concurrent buyers: 500,000-14,000,000 (Eras Tour numbers)
Requests/sec: 500K-3M at peak
Duration of sale: 2-10 minutes for sellout
Contention: extreme
Seat reservation:
Each reservation = (user_id, seat_id, event_id, expires_at, status) = ~100 bytes
Reservations per high-demand event: 70,000 active + 200,000 expired/released = 270K
Total storage per event: 270K * 100 bytes = 27 MB (trivial)
Payment:
Successful payment rate: ~80% (20% abandon during 10-min window)
Seats released/recycled: 70,000 * 0.2 = 14,000 seats recycled
Average recycling time: 10 min TTL + 30 sec processing = 10.5 min
The key number: 14 million concurrent users for one event. No application server fleet can handle 14M simultaneous HTTP connections. You must shed load before it reaches your servers. This is the virtual waiting room problem.
Naive Design
Single PostgreSQL database, REST API.
Schema:
CREATE TABLE events (event_id, venue_id, name, start_time, sale_start);
CREATE TABLE seats (seat_id, event_id, section, row, number, price_tier, status);
-- status: AVAILABLE, RESERVED, SOLD
CREATE TABLE reservations (
reservation_id, user_id, seat_id, event_id,
status, -- PENDING, CONFIRMED, EXPIRED
expires_at,
created_at
);
Reservation flow:
BEGIN;
SELECT * FROM seats WHERE seat_id = :seat AND status = 'AVAILABLE' FOR UPDATE;
-- If available:
UPDATE seats SET status = 'RESERVED' WHERE seat_id = :seat;
INSERT INTO reservations (...) VALUES (...);
COMMIT;
Payment completion:
BEGIN;
UPDATE seats SET status = 'SOLD' WHERE seat_id = :seat;
UPDATE reservations SET status = 'CONFIRMED' WHERE reservation_id = :res;
COMMIT;
TTL expiration (cron job every minute):
UPDATE seats SET status = 'AVAILABLE'
WHERE seat_id IN (
SELECT seat_id FROM reservations
WHERE status = 'PENDING' AND expires_at < NOW()
);
UPDATE reservations SET status = 'EXPIRED'
WHERE status = 'PENDING' AND expires_at < NOW();
This works for the local band playing to 500 people. For Taylor Swift, this database is dead in the first second.
Where It Breaks
Problem 1: SELECT FOR UPDATE Serializes Seat Contention
When 500 users try to book seat A7 simultaneously, SELECT ... FOR UPDATE acquires a row-level lock. 499 users wait for the lock, one gets the seat, 499 retry on another seat. But they all retry on A8, because the seat map showed A8 as available before A7 was locked. Now A8 is serialized. Every hot seat becomes a bottleneck. Database throughput drops to hundreds of transactions per second.
Problem 2: 14M Users Cannot Even Load the Page
Before any booking logic runs, 14 million users are hitting your web servers to load the seat map page. At 100 KB per page load, that is 1.4 TB of bandwidth in seconds. Your load balancer melts before the application servers even see a request.
Problem 3: Stale Seat Maps
The interactive seat map shows available seats. But between the time the map loads and the user clicks "Reserve," dozens of seats have been taken. The user clicks, gets "seat unavailable," clicks another, "unavailable" again. Terrible UX.
Problem 4: Payment Timeout Cascade
When seats expire after 10 minutes, they all become available simultaneously (everyone reserved at the same time, so TTLs expire at the same time). This creates a second thundering herd: the waitlisted users all rush for the released seats.
Problem 5: No Fairness
First-come-first-served means users with faster internet connections, browser auto-clickers, and bot scripts get tickets. Genuine fans with normal hardware lose out. The Eras Tour debacle was partly caused by bot traffic overwhelming the system.
Real Design

Architecture Overview
┌─────────────────────────────────────────────┐
│ CDN (CloudFront) │
│ Serves: static pages, JS, images, waiting │
│ room HTML during high-demand events │
└──────────────────┬──────────────────────────┘
│
┌──────────────────┴──────────────────────────┐
│ Virtual Waiting Room │
│ (Serverless: Lambda@Edge or CloudFlare │
│ Workers) - assigns queue position, │
│ rate-limits entry to booking flow │
└──────────────────┬──────────────────────────┘
│ (controlled flow: 500 users/sec)
┌──────────────────┴──────────────────────────┐
│ Booking API Servers │
│ (Stateless, auto-scaled, behind ALB) │
└─────────┬───────────────────┬───────────────┘
│ │
┌─────────┴─────────┐ ┌──────┴──────────────┐
│ Seat Inventory │ │ Payment Service │
│ (Redis cluster) │ │ (Stripe/Adyen) │
│ Optimistic lock │ │ │
│ with TTL │ │ │
└─────────┬─────────┘ └──────────────────────┘
│
┌─────────┴─────────┐
│ Booking DB │
│ (PostgreSQL) │
│ Source of truth │
└───────────────────┘
Component 1: Virtual Waiting Room
The waiting room is the most important component. It exists to solve one problem: turn 14 million simultaneous users into a controlled flow of 500 users per second.
How it works:
-
Pre-queue phase (15 min before sale): Users arrive and are placed in a "lobby." They see a countdown timer. No queue position yet. This is a static CDN-served page. Zero backend load.
-
Randomization (at sale start): All users who arrived before the sale start time are shuffled randomly and assigned queue positions. This is the fairness mechanism -- arriving 10 minutes early gives you the same chance as arriving 1 minute early. It defeats bots that try to be "first" by arriving microseconds after the sale opens.
-
Controlled release: Users are admitted to the booking flow in queue order, at a rate the booking system can handle (~500/sec). Each admitted user gets a signed JWT token with an expiration time (15 minutes to complete their booking).
-
Queue page: Users waiting in the queue see their position and estimated wait time. This page is served from CDN with JavaScript that polls a lightweight status endpoint (Redis-backed, returns position + estimated time, costs ~0.1 ms per request).
Why randomization matters: Without it, the queue is first-come-first-served based on network latency. Users closer to CDN edge nodes get lower queue positions. Users with faster internet win. Bots that open 1,000 connections at T-0 get 1,000 positions near the front. Randomization equalizes everyone who arrived before the sale start.
Implementation: The waiting room runs on CDN edge compute (CloudFlare Workers, Lambda@Edge). It never touches your application servers. Queue state is stored in a globally distributed key-value store (CloudFlare KV, DynamoDB Global Tables). The admission controller is a simple rate limiter that increments an atomic counter and admits the next N users when capacity is available.
Component 2: Seat Reservation with TTL (Redis)
Redis is the seat inventory engine for high-demand events.
Data model in Redis:
# Seat status: available seats stored in a set
Key: event:{event_id}:available_seats
Type: Set
Members: ["A1", "A2", "A3", ..., "Z30"]
# Reservation with TTL
Key: reservation:{event_id}:{seat_id}
Type: String
Value: user_id
TTL: 600 seconds (10 minutes)
Reservation operation (atomic Lua script):
-- Atomic: check + reserve in one operation
local seat_key = "event:" .. event_id .. ":available_seats"
local res_key = "reservation:" .. event_id .. ":" .. seat_id
-- Check if seat is available
local removed = redis.call("SREM", seat_key, seat_id)
if removed == 0 then
return "SEAT_UNAVAILABLE"
end
-- Create reservation with TTL
redis.call("SET", res_key, user_id, "EX", 600, "NX")
return "RESERVED"
Why Lua script? Redis Lua scripts are atomic -- no other command can execute between the SREM and SET. This eliminates the race condition where two users both see a seat as available and both try to reserve it.
TTL expiration handling: When a reservation TTL expires, Redis deletes the key automatically. But the seat must be returned to the available set. Use Redis keyspace notifications:
Subscribe to __keyevent@0__:expired
When reservation:{event_id}:{seat_id} expires:
SADD event:{event_id}:available_seats seat_id
Staggered TTLs: To prevent the "all seats released simultaneously" problem, add random jitter to the TTL: TTL = 600 + random(0, 60) seconds. This spreads seat releases over a 1-minute window instead of a single instant.
Component 3: Optimistic Locking for Normal Events
For the 99% of events with low contention, you do not need Redis or a waiting room. PostgreSQL with optimistic locking is sufficient and simpler to operate.
Version-based optimistic locking:
-- Each seat has a version column
SELECT seat_id, version FROM seats
WHERE event_id = :event AND seat_id = :seat AND status = 'AVAILABLE';
-- User selects seat (version 1):
UPDATE seats SET status = 'RESERVED', version = version + 1
WHERE seat_id = :seat AND version = 1;
-- If UPDATE affects 0 rows, someone else got the seat. Retry with a different seat.
Why optimistic over pessimistic? Optimistic locking uses no database locks. Under low contention (which is 99% of events), the first UPDATE succeeds and there is no retry. Under high contention, retry rate increases, but you have already redirected high-contention events to the Redis path. The two-tier model: PostgreSQL optimistic for calm events, Redis atomic for hot events.
Component 4: Payment Failure Handling
The 10-minute payment window introduces four failure scenarios:
Scenario 1: Payment succeeds. Happy path. Mark reservation as CONFIRMED, seat as SOLD in the database. Delete the Redis TTL key (or let it expire harmlessly).
Scenario 2: Payment fails (card declined). Release the seat immediately. Add it back to the available set in Redis. Notify the user. Offer them 2 minutes to try a different card.
Scenario 3: Payment timeout (user abandons). Redis TTL expires. Seat returns to available pool via keyspace notification. User's reservation is marked EXPIRED. No further action needed.
Scenario 4: Payment is ambiguous (network timeout to payment gateway). This is the hard one. You sent the charge to Stripe but got no response. Did the charge go through? You do not know. Solutions:
- Idempotency key: Every payment request includes an idempotency key (reservation_id). If you retry the request with the same key, Stripe returns the same result without double-charging.
- Reconciliation: A background job queries Stripe for the status of pending payments every 60 seconds. If the charge succeeded, confirm the reservation. If it failed, release the seat.
- Default to release: If after 3 retries and 2 minutes you still cannot determine payment status, release the seat. Better to lose a sale than to double-charge a customer or hold a phantom reservation.
Component 5: CDN-Served Static Pages During Crush
During the initial rush (first 5 minutes of a hot sale), the seat map is useless -- seats are being taken faster than the map can refresh. Instead:
-
Replace the interactive seat map with a "best available" flow. User selects quantity (2 tickets) and price tier (Floor, Lower Bowl, Upper Deck). The system assigns the best available seats automatically. This eliminates contention on specific seats and reduces the booking flow to a single atomic operation: "Give me 2 floor seats."
-
Serve the waiting room page, the queue status page, and the "best available" selection page from CDN. Only the actual reservation API call hits your servers. Everything else is static HTML + JavaScript.
-
Rate limit API calls per admitted user. Each JWT from the waiting room allows exactly 1 reservation attempt per 5 seconds. This prevents users from hammering the API.
Deep Dives

Deep Dive 1: The Two-Tier Event Model
Not all events are created equal. Treating a 200-person comedy show the same as a Taylor Swift stadium show wastes engineering effort on one end and causes outages on the other.
Tier classification:
Tier 1 (Normal): expected_demand / capacity < 5x
- Use PostgreSQL with optimistic locking
- No waiting room
- Interactive seat map
- Standard auto-scaling
Tier 2 (Hot): expected_demand / capacity >= 5x
- Use Redis seat inventory
- Virtual waiting room with randomization
- "Best available" seat assignment
- Pre-provisioned infrastructure (no auto-scaling -- too slow)
How to predict demand: Pre-registration count. If 500,000 users register interest for a 70,000-seat event, it is Tier 2. This is known days in advance. Ticketmaster uses pre-registration data (their "Verified Fan" program) to estimate demand and pre-provision infrastructure.
Pre-provisioning vs. auto-scaling: For Tier 2 events, you do NOT rely on auto-scaling. Auto-scaling reacts to load, which means the first 30 seconds of the sale hit under-provisioned servers. Instead, pre-provision based on the pre-registration count: spin up N servers, warm up Redis, pre-load the seat inventory, and scale down after the sale ends. Amazon does the same for Prime Day -- they pre-provision capacity weeks in advance.
Unique Resources vs. Fungible Inventory
This Redis lock + TTL pattern works because seats are unique resources -- only one person can hold seat A8. The locking must be per-item because the item itself is the scarce resource. But for quantity-based (fungible) inventory -- e.g., 500 units of a grocery item, or general admission tickets without assigned seats -- distributed locks are overkill. A simple UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity > 0 with the database's native row-level locking is simpler, cheaper, and sufficient. The WHERE quantity > 0 clause prevents overselling, and the single-row update serializes naturally without any external lock coordination. Choose your concurrency control based on whether the resource is unique or fungible -- it is a common mistake to reach for Redis distributed locks when a database atomic decrement is all you need.
Deep Dive 2: Preventing Bots and Scalpers
Bots are the plague of ticket sales. They use automation to grab tickets faster than humans and resell at 5-10x markup.
Defense layers:
-
CAPTCHA at waiting room entry: Solve a CAPTCHA to enter the pre-queue. This blocks simple scripts. Cost: $3 per 1,000 verifications (reCAPTCHA Enterprise).
-
Device fingerprinting: Identify multiple accounts from the same device or browser. Flag and deprioritize. Ticketmaster's "Verified Fan" uses phone number verification + identity checks.
-
Purchase limits: Max 4 tickets per account per event. Enforced at the reservation level, not just the UI.
-
Behavioral analysis: Bots click faster, have predictable mouse movements, and do not scroll naturally. Track interaction patterns during the queue and flag suspicious sessions.
-
Queue position manipulation detection: If someone creates 100 accounts to get 100 queue positions, detect the pattern (same IP, same device fingerprint, accounts created in rapid succession) and invalidate all but one.
The hard truth: No system perfectly stops bots. The goal is to raise the cost of bot operation above the profit margin of resale. If a bot operator needs to solve 100 CAPTCHAs, verify 100 phone numbers, and use 100 different devices to get 100 tickets, the operational cost might exceed the resale profit.
Deep Dive 3: Seat Map Consistency Under Load
The interactive seat map must reflect reality, but at 500 reservations per second, the map is stale before it renders.
Approach 1: Eventual consistency with WebSocket updates.
- Initial seat map is loaded from a Redis snapshot (consistent at load time).
- A WebSocket connection pushes seat status changes in real time.
- Client receives
{ seat: "A7", status: "reserved" }and grays out A7. - Latency: 50-200 ms from reservation to client update.
- Problem: at 500 events/sec, WebSocket bandwidth is 500 * 50 bytes = 25 KB/sec per client. With 10,000 clients viewing the map, that is 250 MB/sec of WebSocket traffic. Manageable with a pub/sub layer (Redis Pub/Sub or a WebSocket gateway).
Approach 2: Polling with server-side rendering.
- Client polls every 2 seconds for the current seat map.
- Server returns a compressed bitmap: 1 = available, 0 = taken. For 70,000 seats: 70,000 bits = 8.75 KB per response.
- 10,000 clients * 8.75 KB * 0.5 requests/sec = 44 MB/sec. Very manageable.
- Staler than WebSocket (up to 2 seconds), but simpler and more bandwidth-efficient.
Approach 3: "Best available" (no seat map).
For Tier 2 events, skip the seat map entirely. Show sections (Floor, Lower Bowl) with available counts. User picks a section and quantity. System assigns best available seats. This eliminates the seat map consistency problem entirely and reduces the booking flow to one API call. Ticketmaster uses this for most high-demand events.
Alternative Designs
| Approach | Pros | Cons | When to Use |
|---|---|---|---|
| Redis atomic + waiting room (described above) | Handles 14M users. Zero overselling. Fair randomization. | Complex infrastructure. Redis is a SPOF for seat state. | Tier 2 events: Taylor Swift, World Cup, Super Bowl |
| PostgreSQL optimistic locking | Simple. ACID compliant. Battle-tested. | Cannot handle > 5K concurrent buyers. Row-level contention kills throughput. | Tier 1 events: 99% of all events. |
| Distributed lock (ZooKeeper/etcd) | Strong consistency. Explicit lock/unlock semantics. | ZooKeeper is not designed for high-throughput locking. 10K locks/sec max. Lock management overhead. | Rarely appropriate for ticket booking. |
| Message queue (SQS/Kafka) for booking requests | Absorbs burst. Process in order. Guaranteed delivery. | Users wait in a queue without knowing if they will get tickets. Latency is seconds, not milliseconds. | When fairness matters more than speed. |
| Blockchain-based tickets (NFT) | Prevents counterfeiting. On-chain ownership verification. | Slow (seconds to confirm). Expensive (gas fees). Terrible UX. Does not solve the booking problem. | Marketing buzzword. Not a real solution. |
Scaling Math Verification
Virtual Waiting Room
Users in pre-queue: 500,000 (Tier 2 event)
Queue state per user: ~200 bytes (user_id, position, token, expiry)
Total queue state: 500K * 200 = 100 MB (fits in one Redis instance)
Admission rate: 500 users/sec
Time to clear 500K queue: 500K / 500 = 1,000 sec = ~17 min
Seats available: 70,000
Avg booking time: 3 min (select seats + pay)
Concurrent users in booking: 500/sec * 180 sec = 90,000 max
Booking API servers needed: 90K / 1K req/sec/server = 90 servers (pre-provisioned)
Redis Seat Inventory
Seats per event: 70,000
Redis set memory: 70K * 10 bytes (seat ID) = 700 KB
Reservation keys: 70K * 100 bytes = 7 MB
Total per event: ~8 MB in Redis (trivial)
Operations/sec: 500 reservations/sec * 2 ops each = 1,000 ops/sec
Redis single-node capacity: 100K ops/sec
Conclusion: One Redis instance handles seat inventory easily.
Replicate 3x for durability and failover.
Payment Processing
Reservations/sec: 500 (rate-limited by waiting room)
Payment processing time: 3-8 seconds per transaction (Stripe/Adyen)
Concurrent payment calls: 500 * 5 sec avg = 2,500 concurrent
Payment service threads: 2,500 / 50 threads/server = 50 servers
Payment timeout handling: 500 * 0.2 (failure rate) = 100 releases/sec
Failure Analysis
| Failure | Impact | Mitigation |
|---|---|---|
| Redis seat inventory crashes | Cannot make new reservations. Users admitted from queue see errors. | Redis replication with automatic failover. Rebuild from PostgreSQL source of truth if needed (takes 30 sec for 70K seats). Pause queue admission during recovery. |
| Waiting room goes down | Users cannot enter the queue. Direct traffic hits booking servers, which collapse. | Waiting room runs on CDN edge (CloudFlare Workers, Lambda@Edge). Multi-region. If edge goes down, fall back to a simple rate limiter at the ALB. |
| Payment gateway timeout | Users wait, then get "payment failed." Seat is held until TTL expires. | Retry with idempotency key. Show "processing" status. Background reconciliation. Default to release after 3 retries. |
| Database (PostgreSQL) goes down | Source of truth is lost. Cannot confirm bookings. | Synchronous replication to standby. Automatic failover. Bookings are buffered in Redis until DB recovers. |
| TTL expiration storm | All seats from the first wave expire at once, causing a second thundering herd. | Staggered TTLs: base TTL + random jitter. Spread releases over 60 seconds instead of 1 second. |
| Bot traffic bypasses CAPTCHA | Bots grab disproportionate share of tickets. Genuine fans locked out. | Defense in depth: CAPTCHA + device fingerprint + purchase limits + behavioral analysis. No single layer is sufficient. |
| CDN goes down during sale | Users cannot load the queue page or seat selection page. | Multi-CDN strategy (CloudFront + Fastly). DNS-level failover. Static HTML fallback served from origin servers. |
Level Expectations
| Level | What the Interviewer Expects |
|---|---|
| Mid (L4) | Database with seat status column and SELECT FOR UPDATE. Knows it does not scale. Mentions caching. Basic payment flow. Handles the happy path. |
| Senior (L5) | Virtual waiting room concept. Redis for seat inventory with atomic operations. TTL-based reservation expiration. Two-tier event model (normal vs. hot). Payment failure handling with idempotency. Bot prevention as a concern. Quantified math for one high-demand event. |
| Staff+ (L6) | Pre-queue randomization for fairness. CDN-served static pages during crush. "Best available" vs. interactive seat map trade-off analysis. Staggered TTLs to prevent expiration storms. Pre-provisioning vs. auto-scaling decision. Detailed payment ambiguity handling (idempotency keys + reconciliation). Reference to Ticketmaster Eras Tour failure and specific lessons learned. Two-tier contention model with quantified thresholds. |
References from Our Courses
- Redis Interview Patterns — distributed locks for seat hold and reservation
- MVCC and Concurrency — optimistic concurrency control for booking conflicts
- Distributed Transactions — two-phase commit for payment and inventory coordination
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.