Cassandra vs DynamoDB — Trade-offs and When to Choose

TL;DR

Cassandra gives you control and portability at the cost of operational pain; DynamoDB gives you zero-ops convenience at the cost of vendor lock-in and unpredictable bills — and for most teams, the right answer is whichever one their infrastructure team can actually run.

What It Is

Cassandra Vs Dynamodb

Cassandra and DynamoDB solve the same fundamental problem: store massive amounts of data with predictable, low-latency reads and writes, distributed across multiple nodes. Both use partition keys for data distribution. Both sacrifice relational features (JOINs, complex transactions) for horizontal scalability.

But they make opposite bets on the build-vs-buy spectrum.

Cassandra says: "Here's the engine. You maintain it." Open-source, run it anywhere, tune every knob. Full control, full responsibility.

DynamoDB says: "Here's the service. We maintain it." Fully managed, AWS-only, fewer knobs. Zero operational overhead, total vendor lock-in.

Amazon built DynamoDB after experiencing severe outages with their previous database systems during peak shopping events. The 2007 Dynamo paper described the architecture. DynamoDB launched in 2012 as a managed service based on those ideas. Today it powers Amazon.com's shopping cart, Alexa, and Twitch.

Cassandra was open-sourced by Facebook in 2008 after they built it for inbox search. It's now an Apache project used by Apple (150K+ nodes), Netflix, Discord (before their ScyllaDB migration), and Uber.

Cassandra — The Self-Managed Option

Architecture

Cassandra uses a peer-to-peer ring topology. No master node. Every node is equal. Any node can serve any request.

         Node A
        /      \
      /          \
   Node F      Node B
     |            |
   Node E      Node C
      \          /
        \      /
         Node D

Every node talks to every other via gossip protocol.
No single point of failure.

Gossip protocol — every second, each node picks 1-3 random nodes and exchanges state information (who's alive, who's dead, who owns which token ranges). Within seconds, the entire cluster knows about topology changes. No ZooKeeper. No coordination service. Just gossip.

Virtual nodes (vnodes) — instead of each node owning one contiguous range on the token ring, each node owns many small ranges (256 by default). This makes data distribution more even and rebalancing smoother when nodes are added or removed.

Cassandra Query Language (CQL)

CQL looks like SQL but has strict constraints:

-- Table definition
CREATE TABLE orders (
    customer_id UUID,
    order_date  DATE,
    order_id    UUID,
    total       DECIMAL,
    items       LIST<TEXT>,
    PRIMARY KEY ((customer_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);

-- This works (partition key + clustering key prefix)
SELECT * FROM orders
WHERE customer_id = 'abc'
AND order_date >= '2024-01-01';

-- This FAILS (no partition key)
SELECT * FROM orders
WHERE order_date >= '2024-01-01';

-- This FAILS (skipping clustering key)
SELECT * FROM orders
WHERE customer_id = 'abc'
AND order_id = 'xyz';

CQL enforces the data model. You can't run ad-hoc queries. Every query must include the full partition key. If your query doesn't fit your table's primary key, you need a different table.

Operational Considerations

Running Cassandra means operating a distributed system. Here's what that actually involves:

Task	Frequency	Pain Level
Monitoring cluster health	Continuous	Low (Prometheus + Grafana)
Running repairs	Weekly	Medium (can impact performance)
Adding/removing nodes	As needed	Medium (rebalancing takes time)
JVM tuning (GC pauses)	Initially + after incidents	High (requires deep Java knowledge)
Compaction tuning	Periodically	Medium (wrong strategy = cascading failures)
Schema migrations	Per release	Low (CQL ALTER is simple)
Upgrading Cassandra versions	Quarterly-ish	High (rolling upgrades, version compatibility)
Handling data center failures	Rarely	Very High (manual intervention required)

The JVM is the elephant in the room. Cassandra runs on Java. Garbage collection pauses can spike read latency from 5ms to 500ms. Tuning the GC — choosing between G1GC and ZGC, setting heap sizes, adjusting young generation ratios — is a dark art that requires specialized knowledge. This is one of the main reasons Discord left Cassandra for ScyllaDB (written in C++, no GC pauses).

DynamoDB — The Managed Option

Architecture

DynamoDB's internals are proprietary, but the public information reveals:

                      ┌──────────────────┐
                      │   Request Router  │
                      │   (managed by AWS)│
                      └────────┬─────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │ Storage   │    │ Storage   │    │ Storage   │
        │ Node A    │    │ Node B    │    │ Node C    │
        │ (3 copies │    │ (3 copies │    │ (3 copies │
        │  per item)│    │  per item)│    │  per item)│
        └──────────┘    └──────────┘    └──────────┘

You never see the nodes. You never configure replication. You never run repairs. AWS does all of it. Your interface is an API endpoint and a billing dashboard.

Data Model

DynamoDB uses partition key (hash key) and optional sort key (range key). This is functionally identical to Cassandra's partition key and clustering key.

# Create table
table = dynamodb.create_table(
    TableName='orders',
    KeySchema=[
        {'AttributeName': 'customer_id', 'KeyType': 'HASH'},    # partition key
        {'AttributeName': 'order_date',  'KeyType': 'RANGE'},    # sort key
    ],
    AttributeDefinitions=[
        {'AttributeName': 'customer_id', 'AttributeType': 'S'},
        {'AttributeName': 'order_date',  'AttributeType': 'S'},
    ],
    BillingMode='PAY_PER_REQUEST'
)

# Write
table.put_item(Item={
    'customer_id': 'abc',
    'order_date': '2024-04-18',
    'order_id': 'xyz',
    'total': Decimal('99.99'),
})

# Query (must include partition key)
response = table.query(
    KeyConditionExpression=Key('customer_id').eq('abc')
                          & Key('order_date').gte('2024-01-01')
)

DynamoDB-Specific Features

Global Secondary Index (GSI): Creates a copy of the table with a different partition key and sort key. Queries on the GSI hit the copy, not the base table. Eventually consistent by default.

# GSI: query orders by product
table.meta.client.update_table(
    TableName='orders',
    GlobalSecondaryIndexUpdates=[{
        'Create': {
            'IndexName': 'product-index',
            'KeySchema': [
                {'AttributeName': 'product_id', 'KeyType': 'HASH'},
                {'AttributeName': 'order_date', 'KeyType': 'RANGE'},
            ],
            'Projection': {'ProjectionType': 'ALL'},
        }
    }]
)

Local Secondary Index (LSI): Same partition key as the base table but a different sort key. Must be created at table creation time. Strongly consistent reads available.

DynamoDB Streams: Change data capture (CDC). Every write to the table emits an event to a stream. Attach a Lambda function to react to changes — like updating a search index, sending notifications, or replicating to another region.

DynamoDB Accelerator (DAX): In-memory cache that sits in front of DynamoDB. Reduces read latency from single-digit milliseconds to microseconds. Useful for read-heavy workloads. Fully managed — no cache invalidation headaches. Well, fewer headaches. DAX has a 5-minute default TTL and doesn't handle complex query patterns.

Transactions (TransactWriteItems / TransactGetItems): ACID transactions across up to 100 items in one or more tables. Double the cost of regular operations (2 WCUs per transacted write instead of 1). Cassandra has lightweight transactions (LWT) using Paxos, but they're 4-5x slower than regular writes and not recommended at scale.

The Comparison Table

Factor	Cassandra	DynamoDB
Deployment	Self-managed (or DataStax Astra)	Fully managed (AWS)
Cloud	Any cloud, on-premises, multi-cloud	AWS only
Consistency	Tunable per query (ONE to ALL)	Eventual or strong (per read)
Transactions	LWT (Paxos, slow)	TransactWriteItems (double cost)
Secondary indexes	Limited, materialized views	GSI (flexible, but eventually consistent)
Pricing	Infrastructure cost (nodes, disk, network)	Per-request or provisioned capacity
Latency	1-10ms typical (depends on tuning)	Single-digit ms (guaranteed by SLA)
Max item size	2GB per partition (practical limit)	400KB per item
Throughput scaling	Add nodes (manual or auto)	Auto-scaling or on-demand
Query language	CQL (SQL-like)	API calls (SDK/boto3)
Backup	Manual (nodetool snapshot)	Point-in-time recovery (automatic)
Multi-region	Manual setup (NetworkTopologyStrategy)	Global Tables (one checkbox)
Vendor lock-in	None	Complete
Open source	Yes (Apache 2.0)	No

DynamoDB Pricing — The Hidden Trap

DynamoDB's pricing is the number one source of surprise bills on AWS. Understanding it prevents architectural regret.

Provisioned Capacity

You pre-allocate read and write capacity units:

Write Capacity Unit (WCU): 1 write/sec for items up to 1KB
Read Capacity Unit (RCU):  1 read/sec for items up to 4KB (strongly consistent)
                           2 reads/sec for items up to 4KB (eventually consistent)

Pricing (us-east-1):
  WCU: $0.00065 per hour
  RCU: $0.00013 per hour

Example: 1,000 writes/sec + 5,000 reads/sec (eventually consistent): - 1,000 WCUs × $0.00065 × 730 hours = $474.50/month - 2,500 RCUs × $0.00013 × 730 hours = $237.25/month - Total: $711.75/month + storage

On-Demand Capacity

Pay per request. No capacity planning. Sounds great until you see the per-request prices:

On-demand pricing:
  Write: $1.25 per million writes
  Read:  $0.25 per million reads

Same example: 1,000 writes/sec + 5,000 reads/sec: - 1,000 × 86,400 × 30 = 2.59B writes × $1.25/M = $3,240/month - 5,000 × 86,400 × 30 = 12.96B reads × $0.25/M = $3,240/month - Total: $6,480/month + storage

On-demand is ~9x more expensive than provisioned for steady workloads. Use it for spiky, unpredictable traffic. Use provisioned + auto-scaling for predictable patterns.

Gotcha

DynamoDB charges for reads even if the item doesn't exist. A scan of a million items costs the same whether you filter down to 10 results or 10,000. Design your queries to avoid scans. Use Query (with partition key), not Scan.

Cassandra Cost Comparison

Cassandra's cost is infrastructure: EC2 instances + EBS volumes + network transfer.

Example cluster: 6 × i3.2xlarge nodes (8 vCPU, 61GB RAM, 1.9TB NVMe): - On-demand: 6 × $0.624/hour × 730 = $2,733/month - Reserved (1-year): 6 × $0.394/hour × 730 = $1,726/month

For steady, high-throughput workloads, self-managed Cassandra on reserved instances is often cheaper than DynamoDB provisioned capacity. But factor in the engineer-hours to operate it. One senior SRE's salary dwarfs the infrastructure savings.

ScyllaDB — The Third Option

ScyllaDB is Cassandra-compatible but written in C++ instead of Java. It uses a shard-per-core architecture — each CPU core handles its own share of data independently, with no shared mutable state and no garbage collector.

Discord's migration from Cassandra to ScyllaDB is the most cited case study:

Metric	Cassandra	ScyllaDB
p99 read latency	40-125ms	15ms
p99 write latency	5-70ms	5ms
Nodes	177	72
Storage per node	4TB	4TB
GC pauses	Frequent, unpredictable	None (no GC in C++)

The key improvements:

No garbage collector. Java's GC causes unpredictable latency spikes. C++ allocates and frees memory deterministically. Discord's p99 latency improved because the worst-case outliers disappeared.

Shard-per-core. Each CPU core processes its own queue of requests. No thread contention, no lock contention. Context switching is minimized. This is the same architecture as DPDK and Seastar (ScyllaDB's async I/O framework).

Cassandra-compatible. Uses the same CQL query language, same drivers, same data model. Migration from Cassandra to ScyllaDB requires zero application code changes. You swap the database and keep everything else.

The catch? ScyllaDB is maintained by a single company (ScyllaDB Inc.). If they go out of business, you're maintaining a C++ database yourself. Cassandra has the Apache Foundation and a massive community behind it. Longevity risk is real.

When to Choose What

Choose DynamoDB When

You're all-in on AWS. DynamoDB integrates with Lambda, API Gateway, CloudWatch, IAM. The ecosystem multiplier is real.
You don't want to operate a database. Zero nodes, zero repairs, zero JVM tuning. The ops team doesn't need to wake up at 3 AM for a compaction storm.
Your workload is bursty. On-demand pricing handles Black Friday spikes without capacity planning.
You need transactions. TransactWriteItems is limited (100 items, 2x cost) but it works. Cassandra's LWT is slower and more fragile.
Your item size is under 400KB. DynamoDB's hard limit. If your items are larger, it's not the right tool.

Choose Cassandra When

Multi-cloud or on-premises is a requirement. Cassandra runs identically on AWS, GCP, Azure, or bare metal. DynamoDB is AWS-only.
You need tunable consistency. Per-query consistency levels (ONE, QUORUM, ALL) give you flexibility DynamoDB doesn't offer.
Write throughput is extreme. Cassandra's ring topology with no coordinator bottleneck handles millions of writes per second. DynamoDB can too, but the cost scales linearly with throughput.
You want to avoid vendor lock-in. Moving off DynamoDB means rewriting every database call. Moving off Cassandra means deploying Cassandra somewhere else.
Your team can operate it. This is the real filter. If you don't have engineers who can tune JVM GC, manage compaction strategies, and run repairs — don't choose Cassandra.

Choose ScyllaDB When

You'd choose Cassandra but latency matters. ScyllaDB's p99 is consistently better because of no GC pauses.
You want Cassandra compatibility without the Java overhead. Same CQL, same drivers, better performance.
Your team is comfortable with a smaller community. Fewer Stack Overflow answers, fewer blog posts, fewer consultants.

Choose Neither When

You need JOINs. Use PostgreSQL or CockroachDB.
You need full-text search. Use Elasticsearch or PostgreSQL full-text.
You need ad-hoc queries. Wide-column stores are query-driven. If you don't know your queries upfront, you'll fight the database.
Your data is under 100GB. PostgreSQL handles this trivially. Don't add distributed database complexity for small data.

Patterns for System Design Interviews

Pattern 1: "Design a Session Store"

Both DynamoDB and Cassandra work. The comparison:

Approach	DynamoDB	Cassandra
Schema	`session_id` (PK), `user_data`, TTL	`PRIMARY KEY (session_id)`, TTL
TTL	Built-in, automatic deletion	Built-in via `USING TTL`
Latency	Single-digit ms	1-5ms with CL=ONE
Ops effort	None	Cluster management
Cost	Pay per request	Fixed infrastructure

For an interview, DynamoDB is the simpler answer. Mention Cassandra as an alternative for multi-cloud.

Pattern 2: "Design a Chat System Backend"

This is where Cassandra shines:

-- Messages by chat (primary read pattern)
CREATE TABLE messages_by_chat (
    chat_id UUID, bucket INT, sent_at TIMESTAMP, message_id UUID,
    sender_id UUID, body TEXT,
    PRIMARY KEY ((chat_id, bucket), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

-- Messages by user (secondary read pattern)
CREATE TABLE messages_by_sender (
    sender_id UUID, day DATE, sent_at TIMESTAMP, message_id UUID,
    chat_id UUID, body TEXT,
    PRIMARY KEY ((sender_id, day), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

Write to both tables on every message. Read from whichever matches the query. Time-bucketing prevents partition blowup.

DynamoDB would work too, but the 400KB item limit means you can't store message attachments inline, and GSIs for the secondary query pattern add cost and latency.

Pattern 3: "Your Boss Says Migrate from Cassandra to DynamoDB"

Red flags to raise in an interview:

DynamoDB has a 400KB item size limit. If any Cassandra rows exceed this, they need restructuring.
CQL queries don't map 1:1 to DynamoDB API calls. The migration isn't just swapping drivers.
DynamoDB's cost model is different. Provisioned vs on-demand. Run the numbers before committing.
GSIs are eventually consistent. If any Cassandra reads use LOCAL_QUORUM, the DynamoDB equivalent (strongly consistent read) doesn't work on GSIs.
No tunable consistency. DynamoDB is either eventual or strong. No QUORUM, no ONE, no ALL.

Trade-offs Table

Decision	Cassandra	DynamoDB
Operational burden	High — nodes, repairs, compaction, JVM	Zero — fully managed
Cost at high throughput	Lower (fixed infrastructure)	Higher (per-request pricing)
Cost at low throughput	Higher (minimum cluster size)	Lower (pay per request)
Vendor lock-in	None	Complete (AWS)
Latency predictability	Variable (GC pauses)	Consistent (SLA-backed)
Multi-region	Manual but flexible	Global Tables (one click)
Consistency options	ONE, QUORUM, ALL, etc.	Eventual or Strong (binary)
Item size limit	~2GB practical	400KB hard limit
Secondary indexes	Limited, slow	GSI (flexible, but extra cost)
Ecosystem integration	Generic (any cloud)	Deep AWS integration

Scylladb

Interview Gotchas

"When would you pick DynamoDB over Cassandra?"

When the team doesn't have Cassandra operations expertise and the application is AWS-native. DynamoDB's value is zero ops, not superior technology. The database that works because nobody has to babysit it beats the database that's theoretically better but crashes when the SRE is on vacation.

"What's the DynamoDB 400KB limit and how do you work around it?"

Each item (row) in DynamoDB cannot exceed 400KB. For larger payloads, store the metadata in DynamoDB and the payload in S3. Link them with the S3 key stored as an attribute. This is the standard pattern for documents, images, or any large blob.

"Can you do JOINs in either?"

No. Neither Cassandra nor DynamoDB supports JOINs. The workaround is denormalization — store the joined data together in one table. If you need JOINs, use a relational database. Trying to fake JOINs with multiple queries and application-side merging is a performance anti-pattern at scale.

"What's the biggest risk of migrating off DynamoDB?"

Lock-in isn't just the API. It's DynamoDB Streams feeding Lambdas, DAX caching, IAM-based access control, CloudWatch alarms, and point-in-time recovery. Moving to Cassandra means rebuilding all of that infrastructure. The database swap is 20% of the work. The ecosystem swap is 80%.

"How does DynamoDB handle hot partitions?"

DynamoDB has adaptive capacity — it automatically shifts throughput capacity from idle partitions to hot ones. But there's a limit. A single partition can handle up to 3,000 RCUs and 1,000 WCUs per second. Beyond that, requests get throttled even if the table has spare total capacity. The fix is distributing writes more evenly across partition keys. For example, appending a random suffix to the partition key: product#1234#shard3.

"What is ScyllaDB and when would you suggest it?"

ScyllaDB is a Cassandra-compatible database written in C++ instead of Java. Same CQL, same drivers, same data model. No garbage collection pauses. Discord migrated from Cassandra to ScyllaDB and saw p99 latency drop from 40-125ms to 15ms while using 60% fewer nodes. Suggest it when the team would choose Cassandra but can't tolerate Java GC latency spikes.

Key Takeaways

Concept	What to Remember
Cassandra = control + ops burden	Run anywhere, tune everything, manage everything
DynamoDB = convenience + lock-in	Zero ops, AWS only, predictable pricing traps
On-demand DynamoDB is ~9x pricier	Use provisioned + auto-scaling for steady workloads
400KB item limit	DynamoDB's hard constraint — offload large items to S3
ScyllaDB	Cassandra-compatible, C++, no GC pauses. Discord's migration case study.
Choose based on team capability	The best database is the one your team can operate reliably
Neither does JOINs	If you need relational queries, use a relational database