Skip to content

Interview Cheat Sheet

TL;DR

Every key decision, comparison table, and interview phrase from the entire Data Modelling module — consolidated into one page. Review this the night before your interview.

Database Selection Flowchart

Database selection flowchart: decision tree for choosing between PostgreSQL, NewSQL, Cassandra, Redis, Elasticsearch, and Neo4j

Key phrase: "I'll default to PostgreSQL unless a specific access pattern demands otherwise."

Schema Design Checklist

  1. List core entities (nouns from requirements)
  2. Define primary keys (UUID or Snowflake ID for distributed systems)
  3. Draw relationships (1:N → FK on "many" side, M:N → junction table)
  4. Identify hot read paths → these drive indexing and denormalization
  5. Normalize to 3NF, then denormalize strategically for hot paths

Key phrase: "I'll start with a normalized schema and denormalize only where the primary access pattern demands it."

Normalization vs Denormalization

Normalize Denormalize
Optimize for Writes, consistency Reads, speed
Data redundancy None Intentional duplication
Update cost Change once Change in multiple places
Best when Data changes frequently Data is read-heavy and rarely changes

Indexing Decision Tree

Indexing decision tree: check EXPLAIN, add indexes for sequential scans, use composite indexes, covering indexes, or remove unnecessary indexes

Key phrase: "I'll add an index on (user_id, created_at) since the primary query is fetching a user's recent posts."

The Scaling Ladder

Before sharding, exhaust every simpler option:

# Step What it solves
1 Indexing Solves 80% of performance problems
2 Read Replicas Handle read-heavy traffic
3 Connection Pooling PgBouncer for database protection
4 Caching (Redis) Hot data and computed results
5 Partitioning Split large tables within one database
6 Sharding Split across multiple databases (last resort)
7 NewSQL CockroachDB / Spanner (automatic sharding)

Go top-to-bottom. Most apps never need to go past step 4.

Key phrase: "I'd exhaust simpler options first — indexes, replicas, caching — before introducing the complexity of sharding."

Sharding Quick Reference

Strategy How Best For Watch Out
Hash-based shard = hash(key) % N Even distribution Resharding moves most keys (use consistent hashing)
Range-based Key ranges per shard Time-series, multi-tenant Hotspots on active ranges
Directory-based Lookup table Maximum flexibility SPOF, lookup latency

Good shard keys: user_id, tenant_id, order_id (high cardinality, even distribution, query-aligned)

Bad shard keys: boolean, timestamp (hot partition), low-cardinality enum

Key phrase: "I'd shard by user_id using hash-based sharding with consistent hashing to minimize data movement during resharding."

Replication Models

Model Writes Consistency Best For
Leader-follower Leader only Strong (leader), eventual (followers) Most applications — the default
Multi-leader Any leader Eventual (conflict resolution) Multi-region writes, offline-first apps
Leaderless (quorum) Any node (W nodes) Tunable (W+R>N for strong) High write availability, no SPOF

Quorum formula: W + R > N — ensures at least one overlapping node between read and write sets.

Key phrase: "I'd use leader-follower replication with async followers for read scaling, and route the user's own reads to the leader for read-after-write consistency."

Concurrency Control

Technique When How
SELECT FOR UPDATE High contention (last ticket) Lock rows before reading — second transaction waits
Optimistic locking (version) Low contention (profile edits) Check version at write time, retry on conflict
MVCC / Snapshot isolation Long-running reads Multiple row versions, readers don't block writers
Serializable isolation Must prevent all race conditions Strongest but slowest — use for financial transactions

Key phrase: "For the double-booking problem, I'd use SELECT FOR UPDATE to lock the row. For low-contention updates, optimistic locking with a version column."

Consistency Models

Model Guarantee Use When Example
Strong All reads see latest write Money, tickets, inventory Spanner, single-leader PostgreSQL
Causal Related events in order Comments after posts Some CRDTs
Read-your-writes You see your own updates Profile changes Read from primary after write
Eventual All replicas converge over time Social feeds, DNS, CDN Cassandra, DynamoDB (default)

Key phrase: "Would inconsistent data be catastrophic? If yes, strong consistency. If no, eventual consistency with read-your-writes for the user's own data."

Distributed Transaction Patterns

Pattern How When Cost
2PC Coordinator: prepare → commit Single database, multi-statement Blocking, slow
Saga Sequence of local transactions + compensating actions Microservices, cross-service operations Complexity, eventual consistency
Outbox Write event to outbox table in same transaction Reliable event publishing Extra table, CDC/polling

Key phrase: "I'd use the Saga pattern with compensating transactions — if payment fails, reverse the inventory reservation."

Query Performance Patterns

Problem Symptom Fix
N+1 queries N+1 DB round trips per page JOIN or batch with WHERE IN
Offset pagination Slow at high page numbers Cursor-based (WHERE id < last_seen)
COUNT(*) at scale Slow aggregation on millions of rows Pre-computed counter on the entity
Connection exhaustion "too many connections" PgBouncer + HikariCP

Real-World Reference

Company Database Key Pattern Scale
Facebook TAO Sharded MySQL + cache Graph model on relational storage Billions reads/sec
Twitter Redis timelines Hybrid fan-out (push < 10K, pull for celebrities) 105TB Redis, 39M QPS
Discord ScyllaDB + Rust data services Request coalescing for hot partitions Trillions of messages
Slack MySQL + Vitess Transparent sharding by workspace 2.3M QPS
Uber Schemaless (MySQL) Append-only 3D hash map Mission-critical trip data
Stripe DocDB (custom) 6-step reversible migration 5M QPS, 99.999% uptime
Shopify MySQL pods Tenant isolation, Ghostferry migration Black Friday scale

Capacity Estimation — Byte Sizes

Back-of-envelope math is expected in interviews. Memorize these:

Data Type Size Example
Boolean 1 byte is_active
Integer (32-bit) 4 bytes user_id (up to ~2B)
BigInt (64-bit) 8 bytes Snowflake ID, timestamps
UUID 16 bytes 550e8400-e29b-...
VARCHAR(255) 1-255 bytes username, email
Text (avg tweet) ~300 bytes 280 chars UTF-8
JSON document (small) ~1 KB API response, user profile
Typical DB row ~0.5-2 KB user row with 8-10 columns
Image thumbnail ~20 KB 200x200 JPEG
Image (full) ~2 MB 1080p photo
1 minute of video ~10 MB 720p compressed

Quick estimation formulas:

Question Formula
Storage for 1B users 1B x 1KB/row = 1 TB
Storage for 500M tweets/day x 1 year 500M x 365 x 300B = ~55 TB
QPS from 100M DAU 100M users x ~10 requests/day / 86,400 sec = ~12K QPS
Peak QPS Average QPS x 3 = ~36K QPS

Key phrase: "Let me do a quick capacity estimate — at 1KB per row and 100M users, we're looking at ~100GB of user data, which fits comfortably on a single PostgreSQL instance."

Red Flags — What Never to Say

Interview Anti-Patterns

  1. "Let's shard it" as your first scaling answer — exhaust indexes, replicas, pooling, and caching first
  2. "I'll use MongoDB because it's more flexible" — without a specific access pattern justifying it, this signals cargo-culting
  3. "We need strong consistency everywhere" — shows you don't understand the cost; map consistency per feature
  4. "I'll use a graph database for the social network" — even Facebook uses sharded MySQL; a follows table with indexes is the right answer
  5. "UUIDv4 for all primary keys" — random UUIDs destroy B-tree performance; mention UUIDv7 or Snowflake IDs instead
  6. "Let me add an index on every column" — each index slows writes; index based on query patterns, not schema

Baseline Architecture — The Safe Default

When you blank on where to start, draw this:

Baseline architecture: client to API server, with Redis cache, PostgreSQL primary with read replicas, and S3 with CDN

This covers 90% of interview scenarios. Add Elasticsearch for search, Kafka for async events, or sharding only when the interviewer pushes you there.

The Five Things to Say in Every Data Modelling Discussion

  1. "The core entities are X, Y, Z — each maps to a table." (Shows structured thinking)
  2. "I'll default to PostgreSQL unless a specific access pattern demands otherwise." (Shows pragmatism)
  3. "I'll index on (col_a, col_b) since the primary query filters on both." (Shows performance awareness)
  4. "I'll denormalize X on the Y table to avoid a join on the hot path." (Shows trade-off awareness)
  5. "Before sharding, I'd try indexes, replicas, and caching first." (Shows maturity)