Cassandra vs DynamoDB — Trade-offs and When to Choose
TL;DR
Cassandra gives you control and portability at the cost of operational pain; DynamoDB gives you zero-ops convenience at the cost of vendor lock-in and unpredictable bills — and for most teams, the right answer is whichever one their infrastructure team can actually run.
What It Is

Cassandra and DynamoDB solve the same fundamental problem: store massive amounts of data with predictable, low-latency reads and writes, distributed across multiple nodes. Both use partition keys for data distribution. Both sacrifice relational features (JOINs, complex transactions) for horizontal scalability.
But they make opposite bets on the build-vs-buy spectrum.
Cassandra says: "Here's the engine. You maintain it." Open-source, run it anywhere, tune every knob. Full control, full responsibility.
DynamoDB says: "Here's the service. We maintain it." Fully managed, AWS-only, fewer knobs. Zero operational overhead, total vendor lock-in.
Amazon built DynamoDB after experiencing severe outages with their previous database systems during peak shopping events. The 2007 Dynamo paper described the architecture. DynamoDB launched in 2012 as a managed service based on those ideas. Today it powers Amazon.com's shopping cart, Alexa, and Twitch.
Cassandra was open-sourced by Facebook in 2008 after they built it for inbox search. It's now an Apache project used by Apple (150K+ nodes), Netflix, Discord (before their ScyllaDB migration), and Uber.
Cassandra — The Self-Managed Option
Architecture
Cassandra uses a peer-to-peer ring topology. No master node. Every node is equal. Any node can serve any request.
Node A
/ \
/ \
Node F Node B
| |
Node E Node C
\ /
\ /
Node D
Every node talks to every other via gossip protocol.
No single point of failure.
Gossip protocol — every second, each node picks 1-3 random nodes and exchanges state information (who's alive, who's dead, who owns which token ranges). Within seconds, the entire cluster knows about topology changes. No ZooKeeper. No coordination service. Just gossip.
Virtual nodes (vnodes) — instead of each node owning one contiguous range on the token ring, each node owns many small ranges (256 by default). This makes data distribution more even and rebalancing smoother when nodes are added or removed.
Cassandra Query Language (CQL)
CQL looks like SQL but has strict constraints:
-- Table definition
CREATE TABLE orders (
customer_id UUID,
order_date DATE,
order_id UUID,
total DECIMAL,
items LIST<TEXT>,
PRIMARY KEY ((customer_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);
-- This works (partition key + clustering key prefix)
SELECT * FROM orders
WHERE customer_id = 'abc'
AND order_date >= '2024-01-01';
-- This FAILS (no partition key)
SELECT * FROM orders
WHERE order_date >= '2024-01-01';
-- This FAILS (skipping clustering key)
SELECT * FROM orders
WHERE customer_id = 'abc'
AND order_id = 'xyz';
CQL enforces the data model. You can't run ad-hoc queries. Every query must include the full partition key. If your query doesn't fit your table's primary key, you need a different table.
Operational Considerations
Running Cassandra means operating a distributed system. Here's what that actually involves:
| Task | Frequency | Pain Level |
|---|---|---|
| Monitoring cluster health | Continuous | Low (Prometheus + Grafana) |
| Running repairs | Weekly | Medium (can impact performance) |
| Adding/removing nodes | As needed | Medium (rebalancing takes time) |
| JVM tuning (GC pauses) | Initially + after incidents | High (requires deep Java knowledge) |
| Compaction tuning | Periodically | Medium (wrong strategy = cascading failures) |
| Schema migrations | Per release | Low (CQL ALTER is simple) |
| Upgrading Cassandra versions | Quarterly-ish | High (rolling upgrades, version compatibility) |
| Handling data center failures | Rarely | Very High (manual intervention required) |
The JVM is the elephant in the room. Cassandra runs on Java. Garbage collection pauses can spike read latency from 5ms to 500ms. Tuning the GC — choosing between G1GC and ZGC, setting heap sizes, adjusting young generation ratios — is a dark art that requires specialized knowledge. This is one of the main reasons Discord left Cassandra for ScyllaDB (written in C++, no GC pauses).
DynamoDB — The Managed Option
Architecture
DynamoDB's internals are proprietary, but the public information reveals:
┌──────────────────┐
│ Request Router │
│ (managed by AWS)│
└────────┬─────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Storage │ │ Storage │ │ Storage │
│ Node A │ │ Node B │ │ Node C │
│ (3 copies │ │ (3 copies │ │ (3 copies │
│ per item)│ │ per item)│ │ per item)│
└──────────┘ └──────────┘ └──────────┘
You never see the nodes. You never configure replication. You never run repairs. AWS does all of it. Your interface is an API endpoint and a billing dashboard.
Data Model
DynamoDB uses partition key (hash key) and optional sort key (range key). This is functionally identical to Cassandra's partition key and clustering key.
# Create table
table = dynamodb.create_table(
TableName='orders',
KeySchema=[
{'AttributeName': 'customer_id', 'KeyType': 'HASH'}, # partition key
{'AttributeName': 'order_date', 'KeyType': 'RANGE'}, # sort key
],
AttributeDefinitions=[
{'AttributeName': 'customer_id', 'AttributeType': 'S'},
{'AttributeName': 'order_date', 'AttributeType': 'S'},
],
BillingMode='PAY_PER_REQUEST'
)
# Write
table.put_item(Item={
'customer_id': 'abc',
'order_date': '2024-04-18',
'order_id': 'xyz',
'total': Decimal('99.99'),
})
# Query (must include partition key)
response = table.query(
KeyConditionExpression=Key('customer_id').eq('abc')
& Key('order_date').gte('2024-01-01')
)
DynamoDB-Specific Features
Global Secondary Index (GSI): Creates a copy of the table with a different partition key and sort key. Queries on the GSI hit the copy, not the base table. Eventually consistent by default.
# GSI: query orders by product
table.meta.client.update_table(
TableName='orders',
GlobalSecondaryIndexUpdates=[{
'Create': {
'IndexName': 'product-index',
'KeySchema': [
{'AttributeName': 'product_id', 'KeyType': 'HASH'},
{'AttributeName': 'order_date', 'KeyType': 'RANGE'},
],
'Projection': {'ProjectionType': 'ALL'},
}
}]
)
Local Secondary Index (LSI): Same partition key as the base table but a different sort key. Must be created at table creation time. Strongly consistent reads available.
DynamoDB Streams: Change data capture (CDC). Every write to the table emits an event to a stream. Attach a Lambda function to react to changes — like updating a search index, sending notifications, or replicating to another region.
DynamoDB Accelerator (DAX): In-memory cache that sits in front of DynamoDB. Reduces read latency from single-digit milliseconds to microseconds. Useful for read-heavy workloads. Fully managed — no cache invalidation headaches. Well, fewer headaches. DAX has a 5-minute default TTL and doesn't handle complex query patterns.
Transactions (TransactWriteItems / TransactGetItems): ACID transactions across up to 100 items in one or more tables. Double the cost of regular operations (2 WCUs per transacted write instead of 1). Cassandra has lightweight transactions (LWT) using Paxos, but they're 4-5x slower than regular writes and not recommended at scale.
The Comparison Table
| Factor | Cassandra | DynamoDB |
|---|---|---|
| Deployment | Self-managed (or DataStax Astra) | Fully managed (AWS) |
| Cloud | Any cloud, on-premises, multi-cloud | AWS only |
| Consistency | Tunable per query (ONE to ALL) | Eventual or strong (per read) |
| Transactions | LWT (Paxos, slow) | TransactWriteItems (double cost) |
| Secondary indexes | Limited, materialized views | GSI (flexible, but eventually consistent) |
| Pricing | Infrastructure cost (nodes, disk, network) | Per-request or provisioned capacity |
| Latency | 1-10ms typical (depends on tuning) | Single-digit ms (guaranteed by SLA) |
| Max item size | 2GB per partition (practical limit) | 400KB per item |
| Throughput scaling | Add nodes (manual or auto) | Auto-scaling or on-demand |
| Query language | CQL (SQL-like) | API calls (SDK/boto3) |
| Backup | Manual (nodetool snapshot) | Point-in-time recovery (automatic) |
| Multi-region | Manual setup (NetworkTopologyStrategy) | Global Tables (one checkbox) |
| Vendor lock-in | None | Complete |
| Open source | Yes (Apache 2.0) | No |
DynamoDB Pricing — The Hidden Trap
DynamoDB's pricing is the number one source of surprise bills on AWS. Understanding it prevents architectural regret.
Provisioned Capacity
You pre-allocate read and write capacity units:
Write Capacity Unit (WCU): 1 write/sec for items up to 1KB
Read Capacity Unit (RCU): 1 read/sec for items up to 4KB (strongly consistent)
2 reads/sec for items up to 4KB (eventually consistent)
Pricing (us-east-1):
WCU: $0.00065 per hour
RCU: $0.00013 per hour
Example: 1,000 writes/sec + 5,000 reads/sec (eventually consistent): - 1,000 WCUs × $0.00065 × 730 hours = $474.50/month - 2,500 RCUs × $0.00013 × 730 hours = $237.25/month - Total: $711.75/month + storage
On-Demand Capacity
Pay per request. No capacity planning. Sounds great until you see the per-request prices:
Same example: 1,000 writes/sec + 5,000 reads/sec: - 1,000 × 86,400 × 30 = 2.59B writes × $1.25/M = $3,240/month - 5,000 × 86,400 × 30 = 12.96B reads × $0.25/M = $3,240/month - Total: $6,480/month + storage
On-demand is ~9x more expensive than provisioned for steady workloads. Use it for spiky, unpredictable traffic. Use provisioned + auto-scaling for predictable patterns.
Gotcha
DynamoDB charges for reads even if the item doesn't exist. A scan of a million items costs the same whether you filter down to 10 results or 10,000. Design your queries to avoid scans. Use Query (with partition key), not Scan.
Cassandra Cost Comparison
Cassandra's cost is infrastructure: EC2 instances + EBS volumes + network transfer.
Example cluster: 6 × i3.2xlarge nodes (8 vCPU, 61GB RAM, 1.9TB NVMe):
- On-demand: 6 × $0.624/hour × 730 = $2,733/month
- Reserved (1-year): 6 × $0.394/hour × 730 = $1,726/month
For steady, high-throughput workloads, self-managed Cassandra on reserved instances is often cheaper than DynamoDB provisioned capacity. But factor in the engineer-hours to operate it. One senior SRE's salary dwarfs the infrastructure savings.
ScyllaDB — The Third Option
ScyllaDB is Cassandra-compatible but written in C++ instead of Java. It uses a shard-per-core architecture — each CPU core handles its own share of data independently, with no shared mutable state and no garbage collector.
Discord's migration from Cassandra to ScyllaDB is the most cited case study:
| Metric | Cassandra | ScyllaDB |
|---|---|---|
| p99 read latency | 40-125ms | 15ms |
| p99 write latency | 5-70ms | 5ms |
| Nodes | 177 | 72 |
| Storage per node | 4TB | 4TB |
| GC pauses | Frequent, unpredictable | None (no GC in C++) |
The key improvements:
No garbage collector. Java's GC causes unpredictable latency spikes. C++ allocates and frees memory deterministically. Discord's p99 latency improved because the worst-case outliers disappeared.
Shard-per-core. Each CPU core processes its own queue of requests. No thread contention, no lock contention. Context switching is minimized. This is the same architecture as DPDK and Seastar (ScyllaDB's async I/O framework).
Cassandra-compatible. Uses the same CQL query language, same drivers, same data model. Migration from Cassandra to ScyllaDB requires zero application code changes. You swap the database and keep everything else.
The catch? ScyllaDB is maintained by a single company (ScyllaDB Inc.). If they go out of business, you're maintaining a C++ database yourself. Cassandra has the Apache Foundation and a massive community behind it. Longevity risk is real.
When to Choose What
Choose DynamoDB When
- You're all-in on AWS. DynamoDB integrates with Lambda, API Gateway, CloudWatch, IAM. The ecosystem multiplier is real.
- You don't want to operate a database. Zero nodes, zero repairs, zero JVM tuning. The ops team doesn't need to wake up at 3 AM for a compaction storm.
- Your workload is bursty. On-demand pricing handles Black Friday spikes without capacity planning.
- You need transactions. TransactWriteItems is limited (100 items, 2x cost) but it works. Cassandra's LWT is slower and more fragile.
- Your item size is under 400KB. DynamoDB's hard limit. If your items are larger, it's not the right tool.
Choose Cassandra When
- Multi-cloud or on-premises is a requirement. Cassandra runs identically on AWS, GCP, Azure, or bare metal. DynamoDB is AWS-only.
- You need tunable consistency. Per-query consistency levels (ONE, QUORUM, ALL) give you flexibility DynamoDB doesn't offer.
- Write throughput is extreme. Cassandra's ring topology with no coordinator bottleneck handles millions of writes per second. DynamoDB can too, but the cost scales linearly with throughput.
- You want to avoid vendor lock-in. Moving off DynamoDB means rewriting every database call. Moving off Cassandra means deploying Cassandra somewhere else.
- Your team can operate it. This is the real filter. If you don't have engineers who can tune JVM GC, manage compaction strategies, and run repairs — don't choose Cassandra.
Choose ScyllaDB When
- You'd choose Cassandra but latency matters. ScyllaDB's p99 is consistently better because of no GC pauses.
- You want Cassandra compatibility without the Java overhead. Same CQL, same drivers, better performance.
- Your team is comfortable with a smaller community. Fewer Stack Overflow answers, fewer blog posts, fewer consultants.
Choose Neither When
- You need JOINs. Use PostgreSQL or CockroachDB.
- You need full-text search. Use Elasticsearch or PostgreSQL full-text.
- You need ad-hoc queries. Wide-column stores are query-driven. If you don't know your queries upfront, you'll fight the database.
- Your data is under 100GB. PostgreSQL handles this trivially. Don't add distributed database complexity for small data.
Patterns for System Design Interviews
Pattern 1: "Design a Session Store"
Both DynamoDB and Cassandra work. The comparison:
| Approach | DynamoDB | Cassandra |
|---|---|---|
| Schema | session_id (PK), user_data, TTL |
PRIMARY KEY (session_id), TTL |
| TTL | Built-in, automatic deletion | Built-in via USING TTL |
| Latency | Single-digit ms | 1-5ms with CL=ONE |
| Ops effort | None | Cluster management |
| Cost | Pay per request | Fixed infrastructure |
For an interview, DynamoDB is the simpler answer. Mention Cassandra as an alternative for multi-cloud.
Pattern 2: "Design a Chat System Backend"
This is where Cassandra shines:
-- Messages by chat (primary read pattern)
CREATE TABLE messages_by_chat (
chat_id UUID, bucket INT, sent_at TIMESTAMP, message_id UUID,
sender_id UUID, body TEXT,
PRIMARY KEY ((chat_id, bucket), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);
-- Messages by user (secondary read pattern)
CREATE TABLE messages_by_sender (
sender_id UUID, day DATE, sent_at TIMESTAMP, message_id UUID,
chat_id UUID, body TEXT,
PRIMARY KEY ((sender_id, day), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);
Write to both tables on every message. Read from whichever matches the query. Time-bucketing prevents partition blowup.
DynamoDB would work too, but the 400KB item limit means you can't store message attachments inline, and GSIs for the secondary query pattern add cost and latency.
Pattern 3: "Your Boss Says Migrate from Cassandra to DynamoDB"
Red flags to raise in an interview:
- DynamoDB has a 400KB item size limit. If any Cassandra rows exceed this, they need restructuring.
- CQL queries don't map 1:1 to DynamoDB API calls. The migration isn't just swapping drivers.
- DynamoDB's cost model is different. Provisioned vs on-demand. Run the numbers before committing.
- GSIs are eventually consistent. If any Cassandra reads use LOCAL_QUORUM, the DynamoDB equivalent (strongly consistent read) doesn't work on GSIs.
- No tunable consistency. DynamoDB is either eventual or strong. No QUORUM, no ONE, no ALL.
Trade-offs Table
| Decision | Cassandra | DynamoDB |
|---|---|---|
| Operational burden | High — nodes, repairs, compaction, JVM | Zero — fully managed |
| Cost at high throughput | Lower (fixed infrastructure) | Higher (per-request pricing) |
| Cost at low throughput | Higher (minimum cluster size) | Lower (pay per request) |
| Vendor lock-in | None | Complete (AWS) |
| Latency predictability | Variable (GC pauses) | Consistent (SLA-backed) |
| Multi-region | Manual but flexible | Global Tables (one click) |
| Consistency options | ONE, QUORUM, ALL, etc. | Eventual or Strong (binary) |
| Item size limit | ~2GB practical | 400KB hard limit |
| Secondary indexes | Limited, slow | GSI (flexible, but extra cost) |
| Ecosystem integration | Generic (any cloud) | Deep AWS integration |

Interview Gotchas
"When would you pick DynamoDB over Cassandra?"
When the team doesn't have Cassandra operations expertise and the application is AWS-native. DynamoDB's value is zero ops, not superior technology. The database that works because nobody has to babysit it beats the database that's theoretically better but crashes when the SRE is on vacation.
"What's the DynamoDB 400KB limit and how do you work around it?"
Each item (row) in DynamoDB cannot exceed 400KB. For larger payloads, store the metadata in DynamoDB and the payload in S3. Link them with the S3 key stored as an attribute. This is the standard pattern for documents, images, or any large blob.
"Can you do JOINs in either?"
No. Neither Cassandra nor DynamoDB supports JOINs. The workaround is denormalization — store the joined data together in one table. If you need JOINs, use a relational database. Trying to fake JOINs with multiple queries and application-side merging is a performance anti-pattern at scale.
"What's the biggest risk of migrating off DynamoDB?"
Lock-in isn't just the API. It's DynamoDB Streams feeding Lambdas, DAX caching, IAM-based access control, CloudWatch alarms, and point-in-time recovery. Moving to Cassandra means rebuilding all of that infrastructure. The database swap is 20% of the work. The ecosystem swap is 80%.
"How does DynamoDB handle hot partitions?"
DynamoDB has adaptive capacity — it automatically shifts throughput capacity from idle partitions to hot ones. But there's a limit. A single partition can handle up to 3,000 RCUs and 1,000 WCUs per second. Beyond that, requests get throttled even if the table has spare total capacity. The fix is distributing writes more evenly across partition keys. For example, appending a random suffix to the partition key: product#1234#shard3.
"What is ScyllaDB and when would you suggest it?"
ScyllaDB is a Cassandra-compatible database written in C++ instead of Java. Same CQL, same drivers, same data model. No garbage collection pauses. Discord migrated from Cassandra to ScyllaDB and saw p99 latency drop from 40-125ms to 15ms while using 60% fewer nodes. Suggest it when the team would choose Cassandra but can't tolerate Java GC latency spikes.
Key Takeaways
| Concept | What to Remember |
|---|---|
| Cassandra = control + ops burden | Run anywhere, tune everything, manage everything |
| DynamoDB = convenience + lock-in | Zero ops, AWS only, predictable pricing traps |
| On-demand DynamoDB is ~9x pricier | Use provisioned + auto-scaling for steady workloads |
| 400KB item limit | DynamoDB's hard constraint — offload large items to S3 |
| ScyllaDB | Cassandra-compatible, C++, no GC pauses. Discord's migration case study. |
| Choose based on team capability | The best database is the one your team can operate reliably |
| Neither does JOINs | If you need relational queries, use a relational database |