Partition Keys, Clustering Keys, and Data Locality
TL;DR
The partition key determines which node stores your data, the clustering key determines how it's sorted on that node, and getting this right is the entire game in Cassandra — mess it up and you're fighting the database instead of using it.
What It Is

Cassandra is not a relational database. It's a distributed wide-column store designed for massive write throughput and predictable read latency. It achieves this by forcing you to think about data access patterns before you write a single row.
The core concept: data modeling in Cassandra is query-driven, not entity-driven. You don't normalize tables and then figure out queries. You start with your queries and design tables to serve them. One query, one table. If you need the same data for three different queries, you store it three times in three tables.
This sounds wasteful. It is. But disk is cheap. Cross-partition reads at scale are not. Apple runs one of the largest Cassandra deployments in the world — over 150,000 nodes serving iCloud, Maps, and Siri. They store massive amounts of duplicate data. The trade-off is worth it.
When to Use Cassandra vs Alternatives
| Requirement | Best Choice | Why |
|---|---|---|
| Massive write throughput (100K+ writes/sec) | Cassandra | LSM tree storage, no single leader bottleneck |
| Single-digit ms reads with known keys | DynamoDB or Cassandra | Both optimized for key-value-style access |
| Complex JOINs and ad-hoc queries | PostgreSQL | Cassandra has no JOINs. Period. |
| Full-text search | Elasticsearch | Cassandra search is limited |
| Strong consistency required | PostgreSQL or CockroachDB | Cassandra's strong consistency mode is fragile |
| Multi-region, multi-cloud | Cassandra | No vendor lock-in, ring topology works across DCs |
| Zero operational overhead | DynamoDB | Fully managed, no nodes to babysit |
Here's a take that will annoy some people: Cassandra is overused. Teams adopt it because "we need to scale" when they have 50GB of data and 500 requests per second. PostgreSQL handles that in its sleep. Cassandra makes sense when you genuinely have write-heavy workloads at massive scale — think IoT sensors, time-series metrics, messaging platforms. If your data fits on one beefy PostgreSQL instance, keep it simple.
Internals — Partition Keys
How Partitioning Works
Every table in Cassandra has a primary key. The first part of the primary key is the partition key. It determines which node in the cluster stores the row.
CREATE TABLE messages (
chat_id UUID,
sent_at TIMESTAMP,
sender_id UUID,
body TEXT,
PRIMARY KEY ((chat_id), sent_at)
);
-- ^^^^^^^^ ^^^^^^
-- partition clustering
-- key key
When you write a row, Cassandra hashes the partition key:
The token is a 64-bit integer. The cluster's nodes each own a range of tokens on the consistent hashing ring:
Token: 0
│
┌───────┤
Node A │ Node B
[range: -2^63 │ [range: 0
to -1] │ to 2^62]
│ │ │
└───────┼───────┘
│
Node D │ Node C
[range: -2^62 │ [range: 2^62
to -2^63] │ to 2^63]
│ │ │
└───────┼───────┘
│
Token: 2^63
The node that owns the token range containing hash(chat_id) is the primary owner. With replication factor 3, the next two nodes on the ring also store copies.
This is why partition key choice matters so much. All rows with the same partition key land on the same node and are stored physically together on disk. This is data locality — the single most important concept in Cassandra data modeling.
Compound Partition Keys
Sometimes a single column doesn't distribute data well enough. You can use multiple columns as the partition key:
CREATE TABLE metrics (
sensor_id TEXT,
day DATE,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, day), timestamp)
);
-- ^^^^^^^^^^^^^^^^^^
-- compound partition key
Now hash(sensor_id, day) determines the node. Each sensor's data for a single day lives in one partition. This prevents partitions from growing indefinitely — tomorrow's data starts a new partition.
This is a common pattern for time-series data. Without the day component, all data for a sensor across months or years would pile up in one partition. That partition would grow to gigabytes and become a hot spot.
Partition Key Anti-Patterns
Too few distinct values:
-- BAD: Only a handful of distinct countries
PRIMARY KEY ((country), user_id)
-- All US users on one node. That's 300 million rows in one partition.
This creates hot partitions. One node handles a disproportionate share of traffic while others idle.
Too many distinct values with no grouping:
-- BAD: Every row has a unique partition key
PRIMARY KEY ((event_id))
-- No way to range-scan events. Every read is a single-row lookup.
-- To get "all events for user X," you'd scan the entire cluster.
This scatters data so widely that any query needing multiple rows becomes a full cluster scan.
The right balance:
Each partition key value should group 100-10,000 rows together. Enough for efficient range scans within a partition, but not so many that the partition becomes a hot spot.
Internals — Clustering Keys
Sort Order Within a Partition
The clustering key determines how rows are sorted within a partition. Cassandra stores rows with the same partition key together on disk, sorted by the clustering key. This makes range scans within a partition blazingly fast — it's a sequential disk read.
CREATE TABLE messages (
chat_id UUID,
sent_at TIMESTAMP,
sender_id UUID,
body TEXT,
PRIMARY KEY ((chat_id), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);
All messages for chat_id = abc123 are stored together, sorted by sent_at in descending order. Getting the latest 50 messages is a sequential read — no sorting needed.
-- Fast: reads a contiguous slice of one partition
SELECT * FROM messages
WHERE chat_id = 'abc123'
ORDER BY sent_at DESC
LIMIT 50;
-- Also fast: range within one partition
SELECT * FROM messages
WHERE chat_id = 'abc123'
AND sent_at >= '2024-04-01'
AND sent_at <= '2024-04-18';
Multiple Clustering Keys
You can have multiple clustering columns. They define a nested sort order:
CREATE TABLE user_activity (
user_id UUID,
activity_date DATE,
activity_time TIMESTAMP,
action TEXT,
PRIMARY KEY ((user_id), activity_date, activity_time)
) WITH CLUSTERING ORDER BY (activity_date DESC, activity_time DESC);
Data layout on disk:
Partition: user_id = alice
├── activity_date = 2024-04-18
│ ├── activity_time = 14:30:00 → action = "login"
│ ├── activity_time = 14:25:00 → action = "purchase"
│ └── activity_time = 14:20:00 → action = "browse"
├── activity_date = 2024-04-17
│ ├── activity_time = 22:00:00 → action = "logout"
│ └── activity_time = 09:00:00 → action = "login"
└── ...
Key rule: You can filter on clustering columns from left to right, but you cannot skip one. This query works:
-- OK: filter on first clustering column
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_date = '2024-04-18';
-- OK: filter on first AND second clustering columns
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_date = '2024-04-18'
AND activity_time > '14:00:00';
This query does NOT work:
-- FAILS: skipping activity_date, filtering only on activity_time
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_time > '14:00:00';
You can't skip activity_date and jump to activity_time. The data is sorted by activity_date first, then by activity_time within each date. Without specifying the date, Cassandra would have to scan every date to find matching times — which defeats the purpose of the sorted storage.
Gotcha
This left-to-right rule on clustering columns is the most common source of Cassandra query failures for teams coming from SQL databases. In PostgreSQL, you can filter on any indexed column in any order. In Cassandra, the clustering key ordering is the law.
Data Locality — Why It All Matters
Data locality is the payoff for all this careful key design. When rows with the same partition key are physically stored together on disk, reading them is a sequential I/O operation — the fastest thing a disk can do.
Compare two designs for a messaging app:
Design A — Bad locality:
PRIMARY KEY ((message_id))
-- Every message in its own partition
-- "Get messages for chat X" = scatter-gather across all nodes
-- Each message is a random read on a random node
Design B — Good locality:
PRIMARY KEY ((chat_id), sent_at)
-- All messages for a chat in one partition
-- "Get messages for chat X" = sequential read on one node
-- Sorted by time — latest messages read first
Design B is orders of magnitude faster for the most common query. One network hop to one node, one sequential disk read. Design A hits every node in the cluster.
Instagram originally used Cassandra for their feed storage with this kind of partition-per-entity design. Each user's feed was a single partition, sorted by timestamp. Scrolling through the feed was a sequential read within one partition.
The Query-First Modeling Approach
In Cassandra, you design tables backward compared to relational databases:
- List your queries
- Design a table for each query
- Accept data duplication
Query: "Get the 50 most recent messages in chat X"
→ Table: messages_by_chat
PRIMARY KEY ((chat_id), sent_at DESC)
Query: "Get all messages sent by user Y in the last 24 hours"
→ Table: messages_by_sender
PRIMARY KEY ((sender_id, day), sent_at DESC)
Query: "Get unread message count for user Z"
→ Table: unread_counts
PRIMARY KEY ((user_id))
Columns: count COUNTER
Three tables. Overlapping data. The same message might exist in messages_by_chat and messages_by_sender. This feels wrong to anyone trained on relational normalization. It is the correct approach for Cassandra.
Partition Size Limits
Cassandra has practical limits on partition size. Exceeding them doesn't crash the database, but performance degrades badly.
| Metric | Recommended Limit | What Happens When Exceeded |
|---|---|---|
| Partition size | < 100 MB | Compaction slows down, GC pressure increases |
| Row count per partition | < 100,000 rows | Read latency spikes, heap pressure |
| Single row size | < 10 MB | Network and memory overhead per read |
Monitoring partition sizes:
This shows the distribution of partition sizes. If your p99 partition size is over 100MB, you need to rethink your partition key. Adding a time component (like day or month) is the standard fix — it caps partition growth by bucketizing across time periods.
Discord ran into this limit with their messages table. Popular servers like "Midjourney" had millions of messages in a single partition. Read latency spiked, compaction lagged, and garbage collection pauses caused visible outages. They eventually migrated to ScyllaDB (a Cassandra-compatible database written in C++) and redesigned their partition strategy.
Patterns for System Design Interviews
Pattern 1: "Design a Chat Storage System"
-- Store messages for retrieval by chat
CREATE TABLE messages_by_chat (
chat_id UUID,
bucket INT, -- time bucket (e.g., week number)
sent_at TIMESTAMP,
message_id UUID,
sender_id UUID,
body TEXT,
PRIMARY KEY ((chat_id, bucket), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);
The bucket prevents unbounded partition growth. Each chat+week combo is a separate partition. To load the latest messages, query the current bucket. To load older messages, query previous buckets.
Pattern 2: "Design a Time-Series Metrics Store"
CREATE TABLE sensor_readings (
sensor_id TEXT,
day DATE,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, day), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
One partition per sensor per day. A sensor sending data every second generates 86,400 rows per day — well within the 100K limit. Range queries within a day are fast sequential reads.
Pattern 3: "Design a User Activity Feed"
Two tables — one for writing, one for reading:
-- Write path: append activity events
CREATE TABLE user_events (
user_id UUID,
month TEXT,
event_time TIMESTAMP,
event_type TEXT,
details TEXT,
PRIMARY KEY ((user_id, month), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- Read path: latest activity across all users (for admin dashboard)
CREATE TABLE recent_events (
day DATE,
event_time TIMESTAMP,
user_id UUID,
event_type TEXT,
PRIMARY KEY ((day), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Write to both tables on every event. Read from whichever table matches the query pattern.
Trade-offs Table
| Decision | Trade-off |
|---|---|
| Wide partitions (many rows per partition key) | Fast range scans but risk exceeding size limits |
| Narrow partitions (few rows per partition key) | Small partitions but more scatter-gather for multi-row queries |
| Compound partition key with time bucket | Caps partition growth but complicates queries across buckets |
| Single-column partition key | Simpler queries but risk of unbounded partition growth |
| Denormalized tables (one per query) | Fast reads but write amplification and consistency complexity |
| Normalized tables (fewer, shared) | Less duplication but cross-partition queries kill performance |
| Descending clustering order | Optimized for "latest first" but ascending scans are slower |
| More clustering columns | Flexible nested sorting but stricter left-to-right query rules |

Interview Gotchas
"Can you query by a non-partition-key column?"
Not efficiently. Without a partition key, Cassandra must scan every node. You can use ALLOW FILTERING to force it, but that's a full cluster scan — never acceptable in production. If you need to query by a non-key column, create a second table with that column as the partition key, or use a secondary index (limited to low-cardinality columns only).
"What's the difference between a partition key and a primary key?"
The primary key = partition key + clustering key(s). The partition key determines which node. The clustering key determines sort order within that node. The full primary key uniquely identifies a row.
"Why can't you change the sort order after table creation?"
Because the data is physically stored in clustering key order on disk. Changing the sort order would require rewriting every SSTable. If you need both ascending and descending order, create two tables with different CLUSTERING ORDER BY.
"What happens if a partition gets too large?"
Read latency increases because Cassandra must load more data into memory. Compaction takes longer. Garbage collection pressure increases because large partitions create large heap allocations. The fix is time-bucketing: add a date or time component to the partition key to cap growth.
"How is this different from a relational database's composite primary key?"
In PostgreSQL, a composite primary key is just a uniqueness constraint and B-tree index. Any column in the key can be used for filtering independently. In Cassandra, the partition key columns determine physical data placement (which node), and clustering columns must be queried left-to-right. The physical implications are fundamentally different.
"Can two rows have the same partition key and clustering key?"
No. If you insert a row with the same full primary key (partition key + clustering key), Cassandra overwrites the existing row. There's no duplicate check and no error — it's an upsert by default. This catches people off guard. If you need to store multiple events with the same timestamp, add a UUID as an additional clustering column.
Key Takeaways
| Concept | What to Remember |
|---|---|
| Partition key | Determines which node stores the data. hash(partition_key) → token → node. |
| Clustering key | Determines sort order within a partition. Enables fast range scans. |
| Data locality | Rows with the same partition key are stored together on disk. Sequential reads are fast. |
| Query-first modeling | Design tables around queries, not entities. One query, one table. |
| Partition size limits | Keep under 100MB / 100K rows. Time-bucketing is the standard fix. |
| Left-to-right rule | Clustering columns must be filtered in order. Can't skip one. |
| Upsert by default | Same primary key = overwrite. No duplicate detection. |