Data Structures and When to Use Each

TL;DR

Redis isn't a key-value store — it's a data structure server that happens to keep everything in memory. Picking the right data structure is the difference between a clean O(1) lookup and a hacky mess that falls apart at scale.

Overview of Redis data types and their relationships

What Redis Actually Is

Most people call Redis a cache. That undersells it badly. Redis is an in-memory data structure server. The "data structure" part matters more than the "in-memory" part.

Think of it this way: your application already uses hashmaps, sorted trees, and queues. Redis gives you those same data structures, but shared across every server in your fleet, surviving process restarts, and operating at microsecond latency. That's the real value proposition.

Twitter uses Redis to serve its timeline. Snapchat uses it for rate limiting. GitHub used it to back their job queue for years. The common thread? Each one picked the right Redis data structure for their problem. Not strings for everything — the right one.

Strings — The Deceptive Simple One

Strings are the default. When people say "key-value store," they're thinking of strings. A key maps to a blob of bytes — text, JSON, a serialized object, a number. Simple.

But strings have a trick that makes them far more useful than they look: atomic operations.

# Basic set/get
SET user:1001:name "Alice"
GET user:1001:name
# => "Alice"

# Set with expiration — gone after 3600 seconds
SET session:abc123 "{\"user_id\":1001}" EX 3600

# Atomic counter — no race conditions, ever
INCR page:views:/home
INCR page:views:/home
GET page:views:/home
# => "2"

# Increment by arbitrary amount
INCRBY inventory:sku:9912 -1   # decrement stock

# Distributed lock — the single most important Redis one-liner
SET lock:order:5543 "worker-7" NX EX 30
# NX = only set if key doesn't exist
# EX = auto-expire after 30 seconds
# Returns OK if lock acquired, nil if someone else holds it

That SET ... NX EX pattern? You'll use it in almost every system design interview that involves concurrency. It's an atomic test-and-set with a built-in lease timeout. No separate "check then set" — one round trip, no race condition.

When to use strings:

Session tokens (set with TTL, fast lookup)
Atomic counters (page views, inventory, rate limit windows)
Distributed locks (SET NX EX)
Caching serialized objects (JSON blobs, protobuf bytes)
Feature flags (simple on/off, GET returns "1" or nil)

When NOT to use strings:

If you need to update a single field inside a JSON blob, you're doing a full GET, deserialize, modify, serialize, SET cycle. That's what hashes are for.
If you're storing a collection of items, you're just reinventing a list or set badly.

Hashes — Object Storage Done Right

A hash is a mini key-value store inside a key. Think of it as a row in a database table, where each field is a column.

# Store a user profile
HSET user:1001 name "Alice" email "alice@example.com" plan "premium" login_count 42

# Get one field — NOT the whole object
HGET user:1001 plan
# => "premium"

# Get multiple fields
HMGET user:1001 name email
# => ["Alice", "alice@example.com"]

# Update just one field — no read-modify-write cycle
HSET user:1001 login_count 43

# Increment a numeric field atomically
HINCRBY user:1001 login_count 1

# Get everything
HGETALL user:1001
# => {"name": "Alice", "email": "alice@example.com", "plan": "premium", "login_count": "43"}

Why not just store the user as a JSON string? Two reasons.

Partial updates. With a string, updating login_count means: GET the JSON, parse it, change one field, serialize it, SET it back. With a hash, it's a single HINCRBY command. No race condition if two servers increment simultaneously.

Memory efficiency. Redis uses a compact encoding called ziplist (renamed listpack in Redis 7) for small hashes. A hash with fewer than 128 fields and values under 64 bytes each uses dramatically less memory than the equivalent JSON string. Instagram's engineering blog demonstrated significant memory savings by packing small key-value pairs into hashes to exploit Redis's ziplist encoding — avoiding per-key overhead.

When to use hashes:

User profiles, product details, any "entity with fields"
Counters grouped by object (HINCRBY for individual fields)
Configuration objects where you read/write individual settings
Session data with multiple attributes

Gotcha

Don't use hashes as a general-purpose secondary index. HGET is O(1), but scanning all hashes to find "users where plan = premium" requires iterating every key. That's what sets and sorted sets are for.

Lists — Queues and Activity Feeds

A list is a doubly-linked list of strings. Push to either end in O(1), pop from either end in O(1). Random access by index is O(N) — avoid it.

# Message queue pattern: producer pushes left, consumer pops right
LPUSH queue:emails "msg:1001"
LPUSH queue:emails "msg:1002"
RPOP queue:emails
# => "msg:1001"  (FIFO order)

# Blocking pop — consumer waits until something appears
BRPOP queue:emails 30
# Blocks for up to 30 seconds. Returns immediately if an item exists.

# Recent activity feed — keep last 100 items
LPUSH feed:user:1001 "liked post 5543"
LPUSH feed:user:1001 "commented on post 7721"
LTRIM feed:user:1001 0 99
# LTRIM chops the list to elements 0-99, discarding everything older

# Read the feed
LRANGE feed:user:1001 0 9
# => latest 10 activities

The LPUSH + LTRIM combo is a pattern you'll see everywhere. It creates a capped list — a fixed-size window of recent items that never grows unboundedly. GitHub uses this pattern for recent push events on repositories.

When to use lists:

Simple job queues (LPUSH + BRPOP)
Recent activity feeds with a fixed window
Chat message history (latest N messages)
Undo stacks, LIFO buffers

When to skip lists:

If you need priority ordering, use sorted sets instead.
If you need consumer groups (multiple consumers, at-least-once delivery), use streams. Lists have no acknowledgment mechanism — once you RPOP, the item is gone. If your worker crashes before processing it, it's lost.
If the list grows unboundedly and you forget LTRIM, you have a memory leak.

Sets — Uniqueness Without Thinking About It

A set is an unordered collection of unique strings. Adding the same element twice is a no-op. O(1) add, remove, and membership check.

# Track unique visitors
SADD page:visitors:/pricing "user:1001"
SADD page:visitors:/pricing "user:1002"
SADD page:visitors:/pricing "user:1001"   # duplicate, ignored
SCARD page:visitors:/pricing
# => 2

# Tag system
SADD tags:post:5543 "redis" "database" "caching"
SADD tags:post:7721 "redis" "performance"

# Find posts tagged with both "redis" AND "database"
SINTER tags:post:5543 tags:post:7721
# => ["redis"]

# Social features: mutual friends
SADD friends:alice "bob" "charlie" "dave"
SADD friends:bob "alice" "charlie" "eve"
SINTER friends:alice friends:bob
# => ["charlie"]  — mutual friends

# Union: all unique friends across both users
SUNION friends:alice friends:bob
# => ["bob", "charlie", "dave", "alice", "eve"]

# "Has the user already liked this post?"
SISMEMBER post:5543:likes "user:1001"
# => 1 (true) or 0 (false), O(1)

The killer feature of sets is set operations: SINTER (intersection), SUNION (union), SDIFF (difference). These run server-side in Redis, avoiding round trips. When LinkedIn shows "You and Alice have 12 mutual connections," that's conceptually an SINTER.

When to use sets:

Unique visitor tracking (daily, hourly)
Tag systems with intersection queries
Social graph operations (mutual friends, followers in common)
Deduplication (has this event been processed?)
Online/presence tracking (SADD when user connects, SREM when they disconnect)

When sets are the wrong choice:

If you need ranked results, use sorted sets.
If you need approximate counts of massive sets (millions of members), use HyperLogLog — it uses 12KB instead of megabytes.

Sorted Sets — The Interview Favorite

Sorted sets (zsets) are sets where every member has a floating-point score. Members are unique; scores don't have to be. Redis keeps them sorted by score automatically, using a skip list internally.

This is the data structure that shows up in system design interviews more than any other.

# Leaderboard
ZADD leaderboard 1500 "player:alice"
ZADD leaderboard 2300 "player:bob"
ZADD leaderboard 1800 "player:charlie"
ZADD leaderboard 2100 "player:dave"

# Top 3 players (highest scores first)
ZREVRANGE leaderboard 0 2 WITHSCORES
# => ["player:bob", "2300", "player:dave", "2100", "player:charlie", "1800"]

# "What rank am I?" — O(log N)
ZREVRANK leaderboard "player:charlie"
# => 2  (0-indexed, so 3rd place)

# Update a score atomically
ZINCRBY leaderboard 500 "player:alice"
# Alice is now at 2000

# Sliding window rate limiter
# Key: rate:user:1001, Score: timestamp, Member: unique request ID
ZADD rate:user:1001 1713400000 "req:aaa"
ZADD rate:user:1001 1713400001 "req:bbb"
ZADD rate:user:1001 1713400002 "req:ccc"

# Remove entries older than 60 seconds
ZREMRANGEBYSCORE rate:user:1001 0 1713399940

# Count remaining entries — if > threshold, reject
ZCARD rate:user:1001
# => 3 (within the 60-second window)

The sliding window rate limiter pattern is worth memorizing. It's cleaner than the fixed-window INCR approach because it doesn't have the "boundary burst" problem where a user can fire 2x the limit by timing requests around the window edge.

When to use sorted sets:

Leaderboards and rankings (gaming, sales, anything competitive)
Priority queues (score = priority, ZPOPMIN to dequeue)
Rate limiting with sliding windows
Time-series indexes (score = timestamp, range queries with ZRANGEBYSCORE)
Autocomplete (score = frequency, ZRANGEBYLEX for prefix matching)

Interview Tip

If an interviewer asks "how would you build a leaderboard for 10 million players," start with ZADD/ZREVRANK. Mention that ZREVRANK is O(log N) — so even with 10M members, it takes ~23 comparisons. Then discuss sharding if the interviewer pushes further.

Streams — Kafka's Little Brother

Streams landed in Redis 5.0 and they're criminally underused. A stream is an append-only log with consumer groups — basically Kafka in miniature.

# Producer appends events
XADD orders * user_id 1001 item "laptop" amount 999.99
# => "1713400000000-0"  (auto-generated ID: timestamp-sequence)

XADD orders * user_id 1002 item "mouse" amount 29.99
# => "1713400000001-0"

# Simple read — last 10 entries
XRANGE orders - + COUNT 10

# Consumer group — multiple consumers, at-least-once delivery
XGROUP CREATE orders order-processors $ MKSTREAM

# Consumer "worker-1" reads new messages
XREADGROUP GROUP order-processors worker-1 COUNT 5 BLOCK 2000 STREAMS orders >

# After processing, acknowledge
XACK orders order-processors "1713400000000-0"

# Check unacknowledged messages (pending entries)
XPENDING orders order-processors

I'd pick Redis Streams over a full Kafka deployment when you have low-to-moderate throughput (under 50K messages/sec) and don't want to operate a separate Kafka cluster. Streams give you consumer groups, acknowledgment, and replay — the three things you actually need from a message broker. For a startup processing 1,000 orders per minute, running Kafka is like driving a semi truck to the grocery store.

When to use streams:

Event sourcing at small-to-medium scale
Task queues that need acknowledgment and retry
Activity logs with consumer groups
When you already have Redis and don't want another infrastructure dependency

When to use Kafka instead:

Throughput above 100K messages/sec
Need multi-day retention
Need exactly-once semantics (Kafka has it, Redis Streams doesn't)
Cross-datacenter replication

HyperLogLog — Count Billions in 12KB

HyperLogLog (HLL) is a probabilistic data structure that estimates the number of unique elements in a set. It uses a fixed 12KB of memory regardless of how many elements you add — whether it's 1,000 or 1 billion.

The trade-off? It's approximate. The standard error is 0.81%.

# Count unique visitors across an entire day
PFADD visitors:2024-04-18 "user:1001"
PFADD visitors:2024-04-18 "user:1002"
PFADD visitors:2024-04-18 "user:1001"  # duplicate, doesn't affect count

PFCOUNT visitors:2024-04-18
# => 2

# Merge multiple days for weekly count
PFMERGE visitors:week:16 visitors:2024-04-14 visitors:2024-04-15 visitors:2024-04-16 visitors:2024-04-17 visitors:2024-04-18

PFCOUNT visitors:week:16
# => approximate unique visitors across the week (deduplicated!)

To put the memory savings in perspective: if you tracked 10 million unique visitors using a set (SADD), you'd need roughly 400-600MB. With HyperLogLog? 12KB. Same answer, give or take 0.81%.

Google originally published the HyperLogLog paper, and they use variants of it internally for query cardinality estimation in BigQuery and Dremel.

When to use HyperLogLog:

Unique visitor counts (daily, weekly, monthly)
Distinct IP address counting for DDoS detection
Cardinality estimation for analytics dashboards
Any "how many unique X" question where you can tolerate ~1% error

When NOT to use HyperLogLog:

When you need exact counts (financial transactions, billing)
When you need to know which elements are in the set (HLL only gives you the count, not the members)

Bitmaps — One Bit Per User

Bitmaps aren't a separate data type — they're string operations that treat the string as a bit array. Each bit position maps to a user ID or entity, and you flip bits on or off.

# Track daily active users: bit position = user ID
SETBIT dau:2024-04-18 1001 1   # user 1001 was active
SETBIT dau:2024-04-18 1002 1   # user 1002 was active
SETBIT dau:2024-04-18 1003 1   # user 1003 was active

# Was user 1002 active today?
GETBIT dau:2024-04-18 1002
# => 1

# How many users were active today?
BITCOUNT dau:2024-04-18
# => 3

# Users active on BOTH Monday and Tuesday
BITOP AND active-both dau:2024-04-15 dau:2024-04-16
BITCOUNT active-both

# Feature flags: bit 0 = dark mode, bit 1 = beta, bit 2 = new-checkout
SETBIT features:user:1001 0 1   # enable dark mode
SETBIT features:user:1001 2 1   # enable new checkout
GETBIT features:user:1001 1     # is beta enabled?
# => 0

Memory math: 10 million users = 10 million bits = 1.25 MB. Storing the same data as a set of user IDs would cost 80+ MB. Bitmaps win when your user IDs are dense integers.

When to use bitmaps:

Daily/weekly/monthly active user tracking
Feature flag matrices (users x features)
Attendance or login streak tracking
Bloom-filter-style membership testing (with some false positive tolerance)

When bitmaps fail:

Sparse user IDs. If your user IDs range from 1 to 1 billion but you only have 10,000 users, you'd allocate 125 MB for a nearly empty bit array. Use a set instead.

The Decision Table

Decision tree for choosing the right Redis data structure

Here's the cheat sheet. In a system design interview, scan the problem for these signals:

Problem Signal	Redis Data Structure	Key Command	Time Complexity
"Cache this object"	String	SET/GET EX	O(1)
"Atomic counter"	String	INCR/INCRBY	O(1)
"Distributed lock"	String	SET NX EX	O(1)
"Store entity with fields"	Hash	HSET/HGET	O(1)
"Partial field update"	Hash	HSET single field	O(1)
"Job queue"	List or Stream	LPUSH/BRPOP or XADD/XREADGROUP	O(1)
"Recent N items"	List	LPUSH + LTRIM	O(1) push, O(N) trim
"Unique members"	Set	SADD/SISMEMBER	O(1)
"Mutual friends / tags in common"	Set	SINTER	O(N*M)
"Leaderboard / ranking"	Sorted Set	ZADD/ZREVRANK	O(log N)
"Sliding window rate limit"	Sorted Set	ZADD + ZREMRANGEBYSCORE + ZCARD	O(log N)
"Priority queue"	Sorted Set	ZADD + ZPOPMIN	O(log N)
"Event log with consumers"	Stream	XADD/XREADGROUP/XACK	O(1) append
"Approximate unique count"	HyperLogLog	PFADD/PFCOUNT	O(1)
"Daily active users"	Bitmap	SETBIT/BITCOUNT	O(1) set, O(N) count
"Feature flags per user"	Bitmap	SETBIT/GETBIT	O(1)

Trade-offs at a Glance

Data Structure	Memory Efficiency	Query Flexibility	Best For	Avoid When
String	Medium	Low (key lookup only)	Simple cache, counters, locks	Need partial updates
Hash	High (ziplist encoding)	Medium (field-level access)	Entity storage, profiles	Need secondary indexes
List	Medium	Low (head/tail ops only)	Queues, recent items	Need random access, deduplication
Set	Medium-High	High (set operations)	Uniqueness, social graphs	Ordered results needed
Sorted Set	Low (skip list overhead)	High (range + rank queries)	Leaderboards, rate limiters	Simple membership tests
Stream	Medium	High (consumer groups, ranges)	Event logs, task queues	Need >100K msg/sec
HyperLogLog	Minimal (12 KB fixed)	Minimal (count only)	Cardinality estimation	Need exact counts or members
Bitmap	Excellent (1 bit per entity)	Low (bit operations)	Dense boolean arrays	Sparse ID spaces

Interview Gotchas

"Why not just use strings for everything?" Because you lose atomic partial updates. Updating one field of a JSON string requires GET-deserialize-modify-serialize-SET — not atomic, not efficient. Hashes give you field-level atomicity.

"Sorted sets use skip lists, not balanced trees. Why?" Antirez (Redis creator) chose skip lists because they're simpler to implement concurrently, have similar O(log N) performance, and range queries are trivial — just follow forward pointers. B-trees would be faster for on-disk storage, but Redis is in-memory, so skip lists win on simplicity.

"Can Redis Streams replace Kafka?" For small-to-medium workloads, absolutely. For anything above ~50K messages/sec, or when you need multi-day retention and cross-datacenter replication, Kafka is the right tool. Redis Streams also don't support exactly-once delivery — it's at-least-once only.

"How much memory does a sorted set with 1 million members use?" Roughly 80-120 MB, depending on member size. Each entry in a skip list has multiple forward pointers plus the score (8 bytes). If memory is tight, consider whether you really need all 1M members in memory.

Key Takeaways

Concept	What to Remember
Redis is a data structure server	Not just a cache — pick the right structure for the job
Strings + NX + EX	The atomic distributed lock primitive — memorize this
Hashes save memory	Ziplist encoding for small hashes is dramatically more compact than JSON strings
Lists need LTRIM	Without LTRIM, lists grow forever. Always cap them.
Sets for uniqueness	SINTER/SUNION run server-side. Great for social features.
Sorted sets for ranking	O(log N) rank queries. The go-to for leaderboards and rate limiters.
Streams over Lists for queues	Consumer groups, acknowledgment, and replay. Lists have none of that.
HyperLogLog for massive cardinality	12KB for billions of unique elements. ~0.81% error.
Bitmaps for dense booleans	1 bit per user. But only when IDs are dense integers.

Interview Tip

When the interviewer says "design a rate limiter" or "build a leaderboard," don't jump straight to the architecture diagram. Start with: "I'd use a Redis sorted set because..." and explain the data structure choice first. Showing you know why sorted sets work for ranking (O(log N) rank lookup, automatic ordering, range queries) signals deeper understanding than just naming the tool.