Partition Keys, Clustering Keys, and Data Locality

TL;DR

The partition key determines which node stores your data, the clustering key determines how it's sorted on that node, and getting this right is the entire game in Cassandra — mess it up and you're fighting the database instead of using it.

What It Is

Consistent Hashing

Cassandra is not a relational database. It's a distributed wide-column store designed for massive write throughput and predictable read latency. It achieves this by forcing you to think about data access patterns before you write a single row.

The core concept: data modeling in Cassandra is query-driven, not entity-driven. You don't normalize tables and then figure out queries. You start with your queries and design tables to serve them. One query, one table. If you need the same data for three different queries, you store it three times in three tables.

This sounds wasteful. It is. But disk is cheap. Cross-partition reads at scale are not. Apple runs one of the largest Cassandra deployments in the world — over 150,000 nodes serving iCloud, Maps, and Siri. They store massive amounts of duplicate data. The trade-off is worth it.

When to Use Cassandra vs Alternatives

Requirement	Best Choice	Why
Massive write throughput (100K+ writes/sec)	Cassandra	LSM tree storage, no single leader bottleneck
Single-digit ms reads with known keys	DynamoDB or Cassandra	Both optimized for key-value-style access
Complex JOINs and ad-hoc queries	PostgreSQL	Cassandra has no JOINs. Period.
Full-text search	Elasticsearch	Cassandra search is limited
Strong consistency required	PostgreSQL or CockroachDB	Cassandra's strong consistency mode is fragile
Multi-region, multi-cloud	Cassandra	No vendor lock-in, ring topology works across DCs
Zero operational overhead	DynamoDB	Fully managed, no nodes to babysit

Here's a take that will annoy some people: Cassandra is overused. Teams adopt it because "we need to scale" when they have 50GB of data and 500 requests per second. PostgreSQL handles that in its sleep. Cassandra makes sense when you genuinely have write-heavy workloads at massive scale — think IoT sensors, time-series metrics, messaging platforms. If your data fits on one beefy PostgreSQL instance, keep it simple.

Internals — Partition Keys

How Partitioning Works

Every table in Cassandra has a primary key. The first part of the primary key is the partition key. It determines which node in the cluster stores the row.

CREATE TABLE messages (
    chat_id    UUID,
    sent_at    TIMESTAMP,
    sender_id  UUID,
    body       TEXT,
    PRIMARY KEY ((chat_id), sent_at)
);
--              ^^^^^^^^    ^^^^^^
--              partition    clustering
--              key          key

When you write a row, Cassandra hashes the partition key:

token = hash(chat_id)    -- using Murmur3 hash

The token is a 64-bit integer. The cluster's nodes each own a range of tokens on the consistent hashing ring:

                    Token: 0
                      │
              ┌───────┤
         Node A       │         Node B
    [range: -2^63     │    [range: 0
     to -1]           │     to 2^62]
              │       │       │
              └───────┼───────┘
                      │
         Node D       │         Node C
    [range: -2^62     │    [range: 2^62
     to -2^63]        │     to 2^63]
              │       │       │
              └───────┼───────┘
                      │
                  Token: 2^63

The node that owns the token range containing hash(chat_id) is the primary owner. With replication factor 3, the next two nodes on the ring also store copies.

This is why partition key choice matters so much. All rows with the same partition key land on the same node and are stored physically together on disk. This is data locality — the single most important concept in Cassandra data modeling.

Compound Partition Keys

Sometimes a single column doesn't distribute data well enough. You can use multiple columns as the partition key:

CREATE TABLE metrics (
    sensor_id   TEXT,
    day         DATE,
    timestamp   TIMESTAMP,
    value       DOUBLE,
    PRIMARY KEY ((sensor_id, day), timestamp)
);
--              ^^^^^^^^^^^^^^^^^^
--              compound partition key

Now hash(sensor_id, day) determines the node. Each sensor's data for a single day lives in one partition. This prevents partitions from growing indefinitely — tomorrow's data starts a new partition.

This is a common pattern for time-series data. Without the day component, all data for a sensor across months or years would pile up in one partition. That partition would grow to gigabytes and become a hot spot.

Partition Key Anti-Patterns

Too few distinct values:

-- BAD: Only a handful of distinct countries
PRIMARY KEY ((country), user_id)
-- All US users on one node. That's 300 million rows in one partition.

This creates hot partitions. One node handles a disproportionate share of traffic while others idle.

Too many distinct values with no grouping:

-- BAD: Every row has a unique partition key
PRIMARY KEY ((event_id))
-- No way to range-scan events. Every read is a single-row lookup.
-- To get "all events for user X," you'd scan the entire cluster.

This scatters data so widely that any query needing multiple rows becomes a full cluster scan.

The right balance:

Each partition key value should group 100-10,000 rows together. Enough for efficient range scans within a partition, but not so many that the partition becomes a hot spot.

Internals — Clustering Keys

Sort Order Within a Partition

The clustering key determines how rows are sorted within a partition. Cassandra stores rows with the same partition key together on disk, sorted by the clustering key. This makes range scans within a partition blazingly fast — it's a sequential disk read.

CREATE TABLE messages (
    chat_id    UUID,
    sent_at    TIMESTAMP,
    sender_id  UUID,
    body       TEXT,
    PRIMARY KEY ((chat_id), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

All messages for chat_id = abc123 are stored together, sorted by sent_at in descending order. Getting the latest 50 messages is a sequential read — no sorting needed.

-- Fast: reads a contiguous slice of one partition
SELECT * FROM messages
WHERE chat_id = 'abc123'
ORDER BY sent_at DESC
LIMIT 50;

-- Also fast: range within one partition
SELECT * FROM messages
WHERE chat_id = 'abc123'
AND sent_at >= '2024-04-01'
AND sent_at <= '2024-04-18';

Multiple Clustering Keys

You can have multiple clustering columns. They define a nested sort order:

CREATE TABLE user_activity (
    user_id       UUID,
    activity_date DATE,
    activity_time TIMESTAMP,
    action        TEXT,
    PRIMARY KEY ((user_id), activity_date, activity_time)
) WITH CLUSTERING ORDER BY (activity_date DESC, activity_time DESC);

Data layout on disk:

Partition: user_id = alice
├── activity_date = 2024-04-18
│   ├── activity_time = 14:30:00 → action = "login"
│   ├── activity_time = 14:25:00 → action = "purchase"
│   └── activity_time = 14:20:00 → action = "browse"
├── activity_date = 2024-04-17
│   ├── activity_time = 22:00:00 → action = "logout"
│   └── activity_time = 09:00:00 → action = "login"
└── ...

Key rule: You can filter on clustering columns from left to right, but you cannot skip one. This query works:

-- OK: filter on first clustering column
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_date = '2024-04-18';

-- OK: filter on first AND second clustering columns
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_date = '2024-04-18'
AND activity_time > '14:00:00';

This query does NOT work:

-- FAILS: skipping activity_date, filtering only on activity_time
SELECT * FROM user_activity
WHERE user_id = 'alice'
AND activity_time > '14:00:00';

You can't skip activity_date and jump to activity_time. The data is sorted by activity_date first, then by activity_time within each date. Without specifying the date, Cassandra would have to scan every date to find matching times — which defeats the purpose of the sorted storage.

Gotcha

This left-to-right rule on clustering columns is the most common source of Cassandra query failures for teams coming from SQL databases. In PostgreSQL, you can filter on any indexed column in any order. In Cassandra, the clustering key ordering is the law.

Data Locality — Why It All Matters

Data locality is the payoff for all this careful key design. When rows with the same partition key are physically stored together on disk, reading them is a sequential I/O operation — the fastest thing a disk can do.

Compare two designs for a messaging app:

Design A — Bad locality:

PRIMARY KEY ((message_id))
-- Every message in its own partition
-- "Get messages for chat X" = scatter-gather across all nodes
-- Each message is a random read on a random node

Design B — Good locality:

PRIMARY KEY ((chat_id), sent_at)
-- All messages for a chat in one partition
-- "Get messages for chat X" = sequential read on one node
-- Sorted by time — latest messages read first

Design B is orders of magnitude faster for the most common query. One network hop to one node, one sequential disk read. Design A hits every node in the cluster.

Instagram originally used Cassandra for their feed storage with this kind of partition-per-entity design. Each user's feed was a single partition, sorted by timestamp. Scrolling through the feed was a sequential read within one partition.

The Query-First Modeling Approach

In Cassandra, you design tables backward compared to relational databases:

List your queries
Design a table for each query
Accept data duplication

Query: "Get the 50 most recent messages in chat X"
→ Table: messages_by_chat
  PRIMARY KEY ((chat_id), sent_at DESC)

Query: "Get all messages sent by user Y in the last 24 hours"
→ Table: messages_by_sender
  PRIMARY KEY ((sender_id, day), sent_at DESC)

Query: "Get unread message count for user Z"
→ Table: unread_counts
  PRIMARY KEY ((user_id))
  Columns: count COUNTER

Three tables. Overlapping data. The same message might exist in messages_by_chat and messages_by_sender. This feels wrong to anyone trained on relational normalization. It is the correct approach for Cassandra.

Partition Size Limits

Cassandra has practical limits on partition size. Exceeding them doesn't crash the database, but performance degrades badly.

Metric	Recommended Limit	What Happens When Exceeded
Partition size	< 100 MB	Compaction slows down, GC pressure increases
Row count per partition	< 100,000 rows	Read latency spikes, heap pressure
Single row size	< 10 MB	Network and memory overhead per read

Monitoring partition sizes:

nodetool tablehistograms keyspace.table

This shows the distribution of partition sizes. If your p99 partition size is over 100MB, you need to rethink your partition key. Adding a time component (like day or month) is the standard fix — it caps partition growth by bucketizing across time periods.

Discord ran into this limit with their messages table. Popular servers like "Midjourney" had millions of messages in a single partition. Read latency spiked, compaction lagged, and garbage collection pauses caused visible outages. They eventually migrated to ScyllaDB (a Cassandra-compatible database written in C++) and redesigned their partition strategy.

Patterns for System Design Interviews

Pattern 1: "Design a Chat Storage System"

-- Store messages for retrieval by chat
CREATE TABLE messages_by_chat (
    chat_id    UUID,
    bucket     INT,         -- time bucket (e.g., week number)
    sent_at    TIMESTAMP,
    message_id UUID,
    sender_id  UUID,
    body       TEXT,
    PRIMARY KEY ((chat_id, bucket), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

The bucket prevents unbounded partition growth. Each chat+week combo is a separate partition. To load the latest messages, query the current bucket. To load older messages, query previous buckets.

Pattern 2: "Design a Time-Series Metrics Store"

CREATE TABLE sensor_readings (
    sensor_id  TEXT,
    day        DATE,
    timestamp  TIMESTAMP,
    value      DOUBLE,
    PRIMARY KEY ((sensor_id, day), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

One partition per sensor per day. A sensor sending data every second generates 86,400 rows per day — well within the 100K limit. Range queries within a day are fast sequential reads.

Pattern 3: "Design a User Activity Feed"

Two tables — one for writing, one for reading:

-- Write path: append activity events
CREATE TABLE user_events (
    user_id    UUID,
    month      TEXT,
    event_time TIMESTAMP,
    event_type TEXT,
    details    TEXT,
    PRIMARY KEY ((user_id, month), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Read path: latest activity across all users (for admin dashboard)
CREATE TABLE recent_events (
    day        DATE,
    event_time TIMESTAMP,
    user_id    UUID,
    event_type TEXT,
    PRIMARY KEY ((day), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

Write to both tables on every event. Read from whichever table matches the query pattern.

Trade-offs Table

Decision	Trade-off
Wide partitions (many rows per partition key)	Fast range scans but risk exceeding size limits
Narrow partitions (few rows per partition key)	Small partitions but more scatter-gather for multi-row queries
Compound partition key with time bucket	Caps partition growth but complicates queries across buckets
Single-column partition key	Simpler queries but risk of unbounded partition growth
Denormalized tables (one per query)	Fast reads but write amplification and consistency complexity
Normalized tables (fewer, shared)	Less duplication but cross-partition queries kill performance
Descending clustering order	Optimized for "latest first" but ascending scans are slower
More clustering columns	Flexible nested sorting but stricter left-to-right query rules

Partition Clustering

Interview Gotchas

"Can you query by a non-partition-key column?"

Not efficiently. Without a partition key, Cassandra must scan every node. You can use ALLOW FILTERING to force it, but that's a full cluster scan — never acceptable in production. If you need to query by a non-key column, create a second table with that column as the partition key, or use a secondary index (limited to low-cardinality columns only).

"What's the difference between a partition key and a primary key?"

The primary key = partition key + clustering key(s). The partition key determines which node. The clustering key determines sort order within that node. The full primary key uniquely identifies a row.

"Why can't you change the sort order after table creation?"

Because the data is physically stored in clustering key order on disk. Changing the sort order would require rewriting every SSTable. If you need both ascending and descending order, create two tables with different CLUSTERING ORDER BY.

"What happens if a partition gets too large?"

Read latency increases because Cassandra must load more data into memory. Compaction takes longer. Garbage collection pressure increases because large partitions create large heap allocations. The fix is time-bucketing: add a date or time component to the partition key to cap growth.

"How is this different from a relational database's composite primary key?"

In PostgreSQL, a composite primary key is just a uniqueness constraint and B-tree index. Any column in the key can be used for filtering independently. In Cassandra, the partition key columns determine physical data placement (which node), and clustering columns must be queried left-to-right. The physical implications are fundamentally different.

"Can two rows have the same partition key and clustering key?"

No. If you insert a row with the same full primary key (partition key + clustering key), Cassandra overwrites the existing row. There's no duplicate check and no error — it's an upsert by default. This catches people off guard. If you need to store multiple events with the same timestamp, add a UUID as an additional clustering column.

Key Takeaways

Concept	What to Remember
Partition key	Determines which node stores the data. hash(partition_key) → token → node.
Clustering key	Determines sort order within a partition. Enables fast range scans.
Data locality	Rows with the same partition key are stored together on disk. Sequential reads are fast.
Query-first modeling	Design tables around queries, not entities. One query, one table.
Partition size limits	Keep under 100MB / 100K rows. Time-bucketing is the standard fix.
Left-to-right rule	Clustering columns must be filtered in order. Can't skip one.
Upsert by default	Same primary key = overwrite. No duplicate detection.