etcd, Consul, and Modern Alternatives

TL;DR

etcd is ZooKeeper reimagined in Go with Raft and a flat key-value model — and it won the Kubernetes era because it's simpler to operate, easier to understand, and speaks gRPC natively.

What etcd Actually Is

Coordination Comparison

etcd is a distributed key-value store designed for shared configuration and service discovery. CoreOS built it in 2013 specifically for their Container Linux project, and Kubernetes adopted it as the single source of truth for all cluster state.

That Kubernetes connection isn't incidental. It's the reason etcd exists in most infrastructure today. Every pod, service, deployment, and config map in a Kubernetes cluster lives in etcd. When you run kubectl get pods, you're reading from etcd. When you create a deployment, you're writing to etcd.

But etcd is not just "the Kubernetes database." It's a general-purpose coordination service. You can use it for leader election, distributed locks, configuration management — the same things you'd use ZooKeeper for, but with a cleaner API and a protocol (Raft) that more engineers actually understand.

etcd's Architecture

etcd clusters run 3 or 5 nodes, just like ZooKeeper. But the internals differ in important ways.

Client Request (gRPC)
        │
        ▼
    ┌────────┐
    │ Leader │  ← Writes go through leader (Raft)
    └────┬───┘
         │ AppendEntries
    ┌────┼────────────┐
    ▼    ▼            ▼
┌──────┐ ┌──────┐ ┌──────┐
│ F-1  │ │ F-2  │ │ F-3  │
└──────┘ └──────┘ └──────┘
    Each node:
    ├── Raft log (WAL)
    ├── boltdb (key-value storage)
    └── MVCC revision index

Key architectural decisions:

Raft consensus: every write is replicated through Raft. The leader proposes, followers acknowledge, leader commits. This is easier to reason about than ZAB because Raft was designed for understandability.
boltdb storage: each node stores data in boltdb (now bbolt), an embedded B+ tree key-value store. This means etcd data survives restarts.
MVCC (Multi-Version Concurrency Control): every key modification gets a monotonically increasing revision number. Old values aren't deleted immediately. This enables watches and time-travel queries.
gRPC API: strongly typed, efficient binary protocol. No custom wire format to learn.

The Key-Value Model

Unlike ZooKeeper's tree structure, etcd uses a flat key-value namespace. But keys are sorted lexicographically, so you can simulate hierarchy with prefixes.

# Set a key
etcdctl put /config/database/url "postgres://prod:5432"
etcdctl put /config/database/pool_size "20"
etcdctl put /config/cache/ttl "300"

# Get a key
etcdctl get /config/database/url
# /config/database/url
# postgres://prod:5432

# Get all keys with a prefix (simulates "directory listing")
etcdctl get /config/database/ --prefix
# /config/database/pool_size
# 20
# /config/database/url
# postgres://prod:5432

# Get all keys in the entire store
etcdctl get "" --prefix

The flat namespace with prefix queries is a design choice. It avoids the complexity of parent-child relationships and recursive watches. You don't need to create parent "directories" before creating a key. Just write /a/b/c/d directly.

Leases: etcd's Ephemeral Nodes

ZooKeeper has ephemeral nodes tied to sessions. etcd has leases — a more explicit mechanism.

A lease has a TTL. You grant a lease, attach keys to it, and keep refreshing it. If you stop refreshing, the lease expires, and all attached keys are deleted.

# Grant a 30-second lease
etcdctl lease grant 30
# lease 694d7550e9f21b2a granted with TTL(30s)

# Put a key with the lease attached
etcdctl put /services/api/server-1 "10.0.1.5:8080" --lease=694d7550e9f21b2a

# Keep the lease alive (sends heartbeats)
etcdctl lease keep-alive 694d7550e9f21b2a

# If keep-alive stops for 30 seconds, the key is deleted automatically

In application code (Go):

// Grant lease
lease, _ := client.Grant(ctx, 30) // 30-second TTL

// Put with lease
client.Put(ctx, "/services/api/server-1", "10.0.1.5:8080",
    clientv3.WithLease(lease.ID))

// Keep alive — returns a channel, sends heartbeats automatically
ch, _ := client.KeepAlive(ctx, lease.ID)

One lease, many keys. You can attach multiple keys to a single lease. When the lease expires, all attached keys vanish. This is useful for service registration: one heartbeat keeps all of a server's registrations alive.

Spicy opinion: etcd's explicit lease model is better than ZooKeeper's session-tied ephemeral nodes. With ZooKeeper, you can't have different TTLs for different nodes created by the same client. With etcd, one client can hold multiple leases with different TTLs. More flexible, more predictable.

Watch Mechanism: Revision-Based

etcd's watches are fundamentally different from ZooKeeper's. They're revision-based and streaming, not one-time triggers.

# Watch a key — streams all changes
etcdctl watch /config/database/url

# In another terminal:
etcdctl put /config/database/url "postgres://new-db:5432"

# Watch output:
# PUT
# /config/database/url
# postgres://new-db:5432

// Watch in Go — returns a channel of events
watchChan := client.Watch(ctx, "/config/", clientv3.WithPrefix())
for resp := range watchChan {
    for _, event := range resp.Events {
        fmt.Printf("%s %s = %s\n",
            event.Type, event.Kv.Key, event.Kv.Value)
    }
}

Key differences from ZooKeeper watches:

Feature	ZooKeeper	etcd
Trigger	One-time, must re-register	Persistent stream
Granularity	Single znode or children	Key, prefix, or range
History	No replay	Can watch from a specific revision
Protocol	Custom	gRPC streaming

The revision-based design is powerful. If your client disconnects and reconnects, it can say "give me everything since revision 15823" and catch up on missed events. ZooKeeper watches can miss events during reconnection.

// Resume watching from a specific revision
watchChan := client.Watch(ctx, "/config/",
    clientv3.WithPrefix(),
    clientv3.WithRev(15823))  // start from this revision

etcd vs ZooKeeper: The Full Comparison

Dimension	etcd	ZooKeeper
Consensus	Raft	ZAB
Language	Go	Java
API	gRPC + REST gateway	Custom binary protocol
Data model	Flat KV with prefix queries	Hierarchical tree (znodes)
Watches	Streaming, revision-based	One-time triggers
Ephemeral	Lease-based TTL	Session-tied
Max value size	1.5 MB (configurable)	1 MB (hard)
Auth	RBAC + TLS client certs	ACLs per znode
Ecosystem	Kubernetes, CoreDNS, Vitess	Kafka (pre-KRaft), Hadoop, HBase
Operations	Single binary, simpler GC tuning	JVM tuning, more knobs

When etcd wins: new infrastructure, Go ecosystem, Kubernetes-native, teams that value operational simplicity.

When ZooKeeper wins: existing Hadoop/Kafka ecosystem, teams with deep Java expertise, need for hierarchical data model.

Consul: The Swiss Army Knife

HashiCorp Consul is a different beast. It bundles service discovery, health checking, KV store, and service mesh into one tool. Where etcd and ZooKeeper are coordination primitives, Consul is an opinionated platform.

# Register a service with health check
curl -X PUT http://localhost:8500/v1/agent/service/register -d '{
  "Name": "payment-api",
  "Port": 8080,
  "Check": {
    "HTTP": "http://localhost:8080/health",
    "Interval": "10s"
  }
}'

# Discover healthy instances
curl http://localhost:8500/v1/health/service/payment-api?passing

# KV store
curl -X PUT http://localhost:8500/v1/kv/config/db_url \
  -d "postgres://prod:5432"

What Consul adds over etcd/ZooKeeper:

Built-in health checking: HTTP, TCP, gRPC, and script-based checks. No external monitoring needed for service liveness.
DNS interface: services are queryable via DNS (payment-api.service.consul). Clients don't need a Consul SDK.
Service mesh (Connect): automatic mTLS between services, intentions-based authorization.
Multi-datacenter: first-class WAN federation. etcd and ZooKeeper are single-datacenter by design.
Prepared queries: load balancing logic on the server side, including geo-failover.

When Consul: you need service discovery with health checking and don't want to build it yourself. The KV store is a bonus, not the main attraction.

Spicy opinion: Consul tries to do too many things. Teams that adopt it for KV-based coordination often end up fighting its service mesh features, and teams that want a service mesh often find Istio or Linkerd more capable. The sweet spot is when you genuinely need service discovery + health checks + KV in one binary with multi-DC support.

KRaft: Kafka Ditching ZooKeeper

Kafka's biggest operational headache has always been ZooKeeper. Running Kafka means running two distributed systems — Kafka brokers and a ZooKeeper ensemble. Double the monitoring. Double the failure modes. Double the on-call pages.

KRaft (KIP-500, released in production with Kafka 3.3) eliminates ZooKeeper entirely. Kafka brokers manage their own metadata using an internal Raft-based consensus protocol.

Before KRaft:
┌──────────────┐     ┌────────────────┐
│ Kafka Broker │────▶│   ZooKeeper    │
│ Kafka Broker │────▶│   Ensemble     │
│ Kafka Broker │────▶│   (3-5 nodes)  │
└──────────────┘     └────────────────┘

After KRaft:
┌────────────────────────────┐
│ Kafka Broker (Controller)  │  ← elected via Raft
│ Kafka Broker (Follower)    │
│ Kafka Broker (Follower)    │
└────────────────────────────┘
No external dependency.

What KRaft changes:

Fewer moving parts: one system instead of two.
Faster metadata operations: controller failover drops from seconds to milliseconds.
Higher partition limits: ZooKeeper was the bottleneck for clusters with millions of partitions. KRaft removes that ceiling.
Simpler operations: no more ZooKeeper session timeouts causing phantom broker de-registrations.

Confluent migrated their entire cloud platform to KRaft. LinkedIn has been running KRaft in production since 2023. The migration path is well-tested at this point.

Choosing Between Them

Here's the decision tree:

Need coordination?
├── Using Kubernetes? → etcd (you already have it)
├── Using Kafka? → KRaft (if Kafka 3.3+) or ZooKeeper (legacy)
├── Existing Hadoop/HBase? → ZooKeeper
├── Need service discovery + health checks? → Consul
├── Building new infrastructure?
│   ├── Go ecosystem → etcd
│   ├── Java ecosystem → ZooKeeper
│   └── Multi-DC → Consul
└── For SD interviews → just say "coordination service like
    ZooKeeper or etcd" unless specifically asked which one

Patterns for System Design Interviews

Pattern 1: Leader election. "Use etcd/ZooKeeper for leader election." Mention the mechanism briefly (ephemeral sequential nodes for ZK, lease-based campaigning for etcd). Don't spend 5 minutes explaining the protocol.

Pattern 2: Service discovery. "Services register with Consul/etcd on startup, deregister on shutdown. Clients query the registry to find instances." This replaces hardcoded IP addresses and static config files.

Pattern 3: Distributed configuration. "Store feature flags / rate limits / database URLs in etcd. Services watch for changes and apply them without restarts."

Pattern 4: Cluster membership. "Each node holds a lease in etcd. The orchestrator watches the key prefix to know which nodes are alive."

Trade-offs Table

Dimension	etcd	ZooKeeper	Consul
Simplicity	Single binary, easy ops	JVM tuning needed	Most complex (many features)
Protocol	Raft (well understood)	ZAB (less common)	Raft (Consul's own impl)
Watch model	Streaming + replay	One-time triggers	Blocking queries (long poll)
Service discovery	Manual (build it yourself)	Manual (build it yourself)	Built-in with health checks
Multi-DC	Not built-in	Not built-in	First-class support
Ecosystem	Kubernetes-native	Hadoop/Kafka legacy	HashiCorp stack
Max data size	~2-8 GB total (recommended)	~hundreds of MB	~hundreds of MB

KRaft

Interview Gotchas

Gotcha 1: "Can I use etcd as a general-purpose database?" No. etcd is designed for small metadata — cluster state, configuration, service registry. The recommended max data size is a few GB. For actual data storage, use a real database.

Gotcha 2: "How does etcd handle network partitions?" etcd is a CP system. During a partition, the minority side loses quorum and stops accepting writes. The majority side continues normally. When the partition heals, the minority catches up via Raft log replication.

Gotcha 3: "Why not just use Redis for coordination?" Redis isn't designed for consensus. Even with Redis Sentinel or Cluster, there's no guarantee that a key you wrote is replicated before a failover. etcd and ZooKeeper provide linearizable writes — if a write is acknowledged, it's durable on a majority of nodes.

Gotcha 4: "What's the difference between etcd watches and Kafka consumer groups?" Different tools for different problems. etcd watches notify you when metadata changes (config updates, service registrations). Kafka consumer groups process a stream of events. You wouldn't use etcd to process millions of events per second, and you wouldn't use Kafka to store your cluster's configuration.

Gotcha 5: "Should I use ZooKeeper or etcd for a new project?" etcd, almost always. It's simpler to operate, uses Raft (which more engineers understand), has a cleaner API, and the Go ecosystem is more active. The only exception is if you're deeply embedded in the Java/Hadoop ecosystem, where ZooKeeper integration is already built.