ZooKeeper — Ephemeral Nodes, Watches, and Leader Election

TL;DR

ZooKeeper is a distributed filesystem of tiny nodes that appear and disappear based on client connections — and that single trick (ephemeral nodes) is the foundation for leader election, distributed locks, and cluster membership.

What ZooKeeper Actually Is

Leader Election

ZooKeeper is not a database. It's not a message queue. It's a coordination service — a small, strongly consistent data store designed to hold metadata that distributed systems need to agree on.

Think of it as a shared filesystem in memory. Directories and files, but tiny. Every node in your cluster can read the same tree, get notifications when things change, and create entries that vanish when the creator dies. That's it. Three primitives. But those three primitives solve most distributed coordination problems.

Yahoo built ZooKeeper internally around 2007. Apache Kafka depended on it for years. Hadoop relies on it. HBase, Solr, and dozens of other systems use it as their coordination backbone. When you need multiple machines to agree on who's in charge, ZooKeeper is the classic answer.

The Data Model: A Tree of Znodes

ZooKeeper's data model looks like a Unix filesystem. Every entry is called a znode. Each znode has a path, holds data, and can have children.

/
├── /config
│   ├── /config/database_url       → "postgres://prod-db:5432"
│   └── /config/feature_flags      → '{"dark_mode": true}'
├── /election
│   ├── /election/node_0000000001  → "server-a"
│   └── /election/node_0000000002  → "server-b"
└── /locks
    └── /locks/order-processing    → "worker-3"

Key properties of znodes:

Data limit: 1 MB per znode. This is intentional. ZooKeeper stores everything in memory and replicates to every node in the ensemble. If you're storing megabytes, you're using the wrong tool.
Version numbers: Every znode has a version that increments on update. This enables compare-and-swap operations.
ACLs: Each znode has access control. Production ensembles should never run with open ACLs.
Children are ordered: You can list children in creation order, which matters for leader election.

// Create a znode
zk.create("/config/db_url", "postgres://db:5432".getBytes(),
          ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

// Read a znode
byte[] data = zk.getData("/config/db_url", false, null);
String url = new String(data);  // "postgres://db:5432"

// Update with version check (optimistic concurrency)
Stat stat = zk.exists("/config/db_url", false);
zk.setData("/config/db_url", "postgres://new-db:5432".getBytes(),
           stat.getVersion());  // fails if version changed

Ephemeral Nodes: The Primitive That Matters

Regular znodes persist until explicitly deleted. Ephemeral znodes are different. They vanish when the client session ends.

This is the single most important concept in ZooKeeper. Everything else builds on it.

When a server connects to ZooKeeper, it establishes a session. As long as the session is alive (heartbeats flowing), any ephemeral znodes created by that client exist. If the server crashes, loses network, or simply disconnects — the session expires and all its ephemeral nodes disappear automatically.

// Server-A connects and creates an ephemeral node
zk.create("/services/payment/server-a", "192.168.1.10:8080".getBytes(),
          ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);

// If Server-A crashes, after session timeout (typically 6-30s),
// the node /services/payment/server-a is automatically deleted.
// Other servers watching this path get notified.

Why this matters: without ephemeral nodes, you'd need a separate health-check mechanism to detect dead servers and clean up their registrations. Ephemeral nodes make the heartbeat and the registration the same thing.

Ephemeral sequential nodes combine both concepts. ZooKeeper appends a monotonically increasing sequence number to the path. This is the building block for leader election and fair distributed locks.

// Creates /election/node_0000000003 (ZK adds the sequence)
zk.create("/election/node_", "server-c".getBytes(),
          ZooDefs.Ids.OPEN_ACL_UNSAFE,
          CreateMode.EPHEMERAL_SEQUENTIAL);

Watches: Get Notified When Things Change

Watches let a client say: "Tell me when this znode changes." The client registers a watch on a znode, and ZooKeeper sends a one-time notification when that znode is created, deleted, or modified.

One-time. That's a critical detail. After the watch fires, you must re-register it. This prevents slow consumers from accumulating a backlog of unprocessed notifications.

// Register a watch on /config/db_url
byte[] data = zk.getData("/config/db_url", new Watcher() {
    @Override
    public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeDataChanged) {
            // Config changed! Re-read and re-register watch
            fetchConfigAndWatch();
        }
    }
}, null);

There are three types of watches:

Watch Type	Triggers On	Registered Via
Data watch	Znode data changes or deletion	`getData()`, `exists()`
Child watch	Children added or removed	`getChildren()`
Persistent watch (3.6+)	All changes, no re-registration needed	`addWatch()`

Persistent watches were added in ZooKeeper 3.6 to solve the re-registration problem. They stay active until explicitly removed. But most production systems still run on the one-time model, so know both.

Spicy opinion: ZooKeeper's one-time watch model is actually a feature, not a bug. It forces clients to re-read the data when they re-register, which prevents stale state. Persistent watches sound convenient but make it easier to miss updates during reconnection windows.

Leader Election: The Classic Recipe

This is the pattern that shows up in every system design interview involving distributed systems. Here's how it actually works.

Step 1: Each candidate creates an ephemeral sequential node under /election/.

Server-A creates /election/node_0000000001
Server-B creates /election/node_0000000002
Server-C creates /election/node_0000000003

Step 2: Each candidate calls getChildren("/election") to see all candidates.

Step 3: The candidate with the lowest sequence number is the leader.

Step 4: Non-leaders don't watch the leader. They watch the node with the sequence number just before theirs.

Server-C watches /election/node_0000000002 (not node_0000000001)
Server-B watches /election/node_0000000001 (the current leader)

Why not watch the leader directly? Because if the leader dies and all 200 candidates watch it, you get a herd effect — 200 simultaneous getChildren calls and watch re-registrations. By watching only the predecessor, at most one server reacts to any single failure.

Leader dies:
  /election/node_0000000001 (Server-A) → DELETED

  Server-B's watch fires. It checks children:
    /election/node_0000000002 ← lowest → Server-B is new leader
    /election/node_0000000003

  Server-C's watch didn't fire. It's still watching node_0000000002.
  No herd effect.

Step 5: If the predecessor dies, the watcher checks if it's now the lowest. If yes, it becomes leader. If not, it watches its new predecessor.

LinkedIn's Apache Helix uses exactly this recipe for partition leadership in their distributed database infrastructure.

Distributed Lock Recipe

Almost identical to leader election, but with lock semantics.

// 1. Create ephemeral sequential node under /locks/my-resource/
String lockPath = zk.create("/locks/my-resource/lock_",
    myId.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
    CreateMode.EPHEMERAL_SEQUENTIAL);

// 2. Get all children of /locks/my-resource/
List<String> children = zk.getChildren("/locks/my-resource", false);
Collections.sort(children);

// 3. If I'm the lowest, I hold the lock
if (lockPath.endsWith(children.get(0))) {
    // Lock acquired — do work
    doProtectedWork();
    // Release lock by deleting the node
    zk.delete(lockPath, -1);
} else {
    // 4. Watch the node just before mine
    String predecessor = findPredecessor(children, lockPath);
    zk.exists("/locks/my-resource/" + predecessor, watchEvent -> {
        // Predecessor released lock, retry
        tryAcquireLock();
    });
}

The ephemeral aspect is critical. If a lock holder crashes, its ephemeral node disappears, and the next waiter gets the lock. No deadlocks from crashed processes.

Read-write locks extend this pattern. Writers create nodes with a write_ prefix and wait for all earlier nodes. Readers create nodes with a read_ prefix and only wait for earlier write nodes. Multiple readers can proceed concurrently.

Configuration Management

Store configuration in a znode. All services watch it. When you update the config, every watcher gets notified.

/config/rate_limit → "1000"   (all API servers watch this)
/config/db_url    → "postgres://primary:5432"
/config/features  → '{"new_checkout": true}'

This is simpler than it sounds. And it solves a real problem: how do you update configuration across 500 servers without restarting them? Consul and etcd solve the same problem, but ZooKeeper was doing it a decade earlier.

The update flow:

Admin writes new value to the znode (via CLI or custom tool)
ZooKeeper notifies all watchers
Each service re-reads the config and applies the change
Each service re-registers its watch

Netflix used ZooKeeper for dynamic configuration in its early migration to AWS. The pattern works because config data is small (easily under the 1 MB limit) and changes rarely.

ZAB Protocol: How ZooKeeper Stays Consistent

ZooKeeper Atomic Broadcast (ZAB) is the consensus protocol under the hood. It predates Raft by several years and solves the same fundamental problem: getting multiple servers to agree on a sequence of operations.

The ensemble works like this:

Client Write Request
        │
        ▼
    ┌────────┐
    │ Leader │  ← All writes go through the leader
    └────┬───┘
         │ PROPOSE (broadcast to followers)
    ┌────┼─────────────┐
    ▼    ▼             ▼
┌──────┐ ┌──────┐ ┌──────┐
│ F-1  │ │ F-2  │ │ F-3  │
└──┬───┘ └──┬───┘ └──┬───┘
   │ ACK    │ ACK    │ ACK
   └────────┼────────┘
            ▼
     Leader gets majority ACKs → COMMIT

Key points:

Writes go through the leader only. Followers forward write requests to the leader.
Reads can go to any server. This means reads can be stale. For linearizable reads, use sync() before getData().
Majority quorum: a write is committed once a majority acknowledges it.
Total ordering: every write gets a unique, increasing transaction ID (zxid). All servers apply writes in the same order.

ZAB is not Raft, but the ideas overlap. Both use leader-based replication, majority quorum, and log ordering. Raft was designed to be easier to understand. ZAB was designed to be correct. The practical difference for interviews? Nearly zero. Know one, you basically know the other.

The Ensemble: Sizing and Failure Tolerance

A ZooKeeper deployment is called an ensemble. Always an odd number of nodes.

Ensemble Size	Tolerates Failures	Notes
1	0	Dev only. Never production.
3	1	Minimum for production
5	2	Standard for large deployments
7	3	Rarely needed. More nodes = slower writes

Why odd? A 4-node ensemble tolerates 1 failure (needs 3 for quorum). A 3-node ensemble also tolerates 1 failure (needs 2 for quorum). The 4th node adds cost without adding fault tolerance.

Spicy opinion: most teams should run 3-node ensembles. Five is defensible if you need to survive rolling upgrades with a simultaneous failure. Seven is almost always overkill and slows down every write because more followers need to acknowledge.

Session Semantics and Partition Behavior

When a client connects to ZooKeeper, it creates a session with a configured timeout (typically 6-30 seconds).

The session lifecycle:

CONNECTING → CONNECTED → ... → DISCONNECTED → EXPIRED
                ↑                    │
                └── reconnect ───────┘ (if within timeout)

During a network partition:

Client loses connection to its ZooKeeper server
Client enters DISCONNECTED state and tries to reconnect to another server in the ensemble
If it reconnects before the session timeout, all ephemeral nodes survive
If the session timeout expires, ZooKeeper deletes all ephemeral nodes — the client is effectively dead

This means: a temporary network glitch doesn't cause leader re-election or lock release. Only a sustained failure beyond the session timeout triggers cleanup. Set the timeout too short and you get false positives (flapping). Set it too long and failure detection is slow.

Kafka's original ZooKeeper dependency used a 6-second session timeout by default. In practice, teams often bumped this to 18-30 seconds to avoid false leader elections during GC pauses or network hiccups.

Patterns for System Design Interviews

Pattern 1: Leader election for singleton services. "Only one instance should process payments at a time" → ZooKeeper leader election. Mention ephemeral sequential nodes and the herd-effect avoidance.

Pattern 2: Service discovery. Each service registers an ephemeral node under /services/{name}/. Consumers list children to find available instances. Dead services auto-deregister.

Pattern 3: Distributed configuration. "How do we update feature flags across 500 servers?" → Store in a znode, services watch for changes. Instant propagation, no polling.

Pattern 4: Cluster membership. "How does the system know which shards are alive?" → Each shard creates an ephemeral node. The coordinator watches the children list.

Trade-offs Table

Dimension	Advantage	Disadvantage
Consistency	Strong consistency (linearizable writes)	Reads can be stale without `sync()`
Data size	Fast for small metadata	1 MB limit per znode, not a database
Failure detection	Ephemeral nodes auto-cleanup	Session timeout delay (seconds, not ms)
Watches	Real-time notifications	One-time triggers, must re-register
Operations	Battle-tested, decade-old code	JVM-based, needs GC tuning
Protocol	ZAB is proven correct	Less understood than Raft
Scaling	Reads scale with followers	Writes bottleneck on single leader

Znode Tree

Interview Gotchas

Gotcha 1: "Can ZooKeeper handle millions of keys?" No. ZooKeeper stores everything in memory on every node. It's designed for thousands to tens of thousands of znodes, not millions. If you need millions of keys, use Redis or etcd (which is slightly more tolerant of larger datasets).

Gotcha 2: "Are reads strongly consistent?" By default, no. A read can hit a follower that hasn't replicated the latest write. For strong consistency, call sync() before getData(). In practice, most use cases tolerate slightly stale reads.

Gotcha 3: "What happens if the leader dies?" Followers run a new leader election (via ZAB). Typical election takes 200ms to a few seconds. During this window, writes are blocked. Reads can still be served by followers.

Gotcha 4: "Why not just use a database for coordination?" A database doesn't give you ephemeral nodes, watches, or sequential ordering. You'd have to build all of that yourself — plus a reliable failure detector. ZooKeeper packages these primitives so you don't reinvent them.

Gotcha 5: "Why is Kafka removing ZooKeeper?" Operational complexity. Running ZooKeeper alongside Kafka doubles the infrastructure to manage. KRaft (KIP-500) embeds Raft-based metadata management directly into Kafka brokers. One fewer system to deploy, monitor, and debug.