When You Need Coordination and When You Don't

TL;DR

Most teams reach for ZooKeeper or etcd too early — coordination services add a new failure mode, operational burden, and latency to every operation, so you should prove you actually need one before deploying it.

The Coordination Tax

Coordination Decision

Adding a coordination service to your architecture isn't free. It costs you in three ways:

Operational overhead. You now run another distributed system. It needs monitoring, alerting, backup, upgrades, capacity planning, and an on-call rotation that understands it. A 3-node etcd cluster is one more thing that can page you at 3 AM.

Latency. Every coordination operation requires a round trip to the coordination cluster — and often a majority quorum write. A ZooKeeper write takes 2-10ms in a well-tuned cluster. That's fine for leader election (happens rarely) but expensive if you're doing it on every request.

New failure mode. If your coordination service goes down, what happens? If the answer is "the entire system stops," you've created a single point of failure that you didn't have before.

Uber learned this the hard way. They ran a massive ZooKeeper deployment for service discovery and configuration. The operational cost was so high that they built their own system (Ringpop) to avoid ZooKeeper for some use cases, and eventually moved much of their service discovery to DNS-based approaches.

Before adding ZooKeeper or etcd, ask: "Can I solve this without coordination?"

When You Definitely Need Coordination

Some problems genuinely require distributed agreement. Don't try to hack around these.

Leader Election

When exactly one process must perform a task — processing payments, running a scheduler, owning a shard — you need leader election. There's no shortcut.

Without coordination:
  Server-A: "I'll process payments!"
  Server-B: "I'll process payments!"
  → Double charges. Lawsuits. Bad day.

With coordination:
  ZooKeeper/etcd elects one leader.
  Only the leader processes payments.
  If leader dies, a new one takes over.

You can try timestamp-based "leader" selection (lowest uptime wins), but network partitions will give you two leaders. You can try database-based locks, but databases aren't designed for fast failure detection. For real leader election, use a real coordination service.

Where this shows up: Kafka controller election, Elasticsearch master node, any singleton service.

Distributed Locks with Fencing

When multiple processes compete for a shared resource and you need mutual exclusion with correctness guarantees, you need a coordination service with fencing tokens.

A fencing token is a monotonically increasing number issued with the lock. The resource (database, file system, external API) checks the token and rejects operations from stale lock holders.

Timeline:
  T=0   Client-A acquires lock, gets fencing token 33
  T=1   Client-A starts writing to storage
  T=2   Client-A hits a long GC pause...
  T=3   Lock expires (client missed heartbeat)
  T=4   Client-B acquires lock, gets fencing token 34
  T=5   Client-B writes to storage with token 34
  T=6   Client-A wakes up, tries to write with token 33
  T=7   Storage rejects token 33 (< current token 34)
  → Correctness preserved.

Without fencing tokens, Client-A's write at T=6 would corrupt data. This is Martin Kleppmann's core argument in his famous "How to do distributed locking" blog post. He demonstrated that Redlock (Redis distributed lock) doesn't provide fencing tokens, making it unsafe for correctness-critical operations.

Spicy opinion: if you don't need fencing tokens, you probably don't need a distributed lock at all. An approximate lock (Redis SET NX EX) is fine for performance optimization. A proper lock (ZooKeeper/etcd with fencing) is for correctness. Know which one you need.

Configuration Management

When you need to update configuration across hundreds of servers without restarts, and you need the update to propagate in seconds, a coordination service is the right tool.

Admin updates /config/rate_limit from 1000 to 500
    → ZooKeeper/etcd notifies all watchers
    → 500 servers pick up the change within seconds
    → No restart, no deployment, no rolling update

Alternatives exist (environment variables, config files, feature flag services like LaunchDarkly), but they either require restarts or add their own complexity. If you're already running a coordination service, config management is a natural fit.

Service Discovery

Services need to find each other. Hardcoded IPs break when instances scale up, scale down, or move. A coordination service can be the registry.

Payment-API starts:
  → Registers /services/payment/instance-3 = "10.0.1.15:8080"

Order-Service needs Payment-API:
  → Lists /services/payment/* → gets all healthy instances
  → Picks one (round-robin, random, least connections)

But there are lighter-weight alternatives. Kubernetes has built-in service discovery via DNS and kube-proxy. AWS has ELB and Cloud Map. You don't always need a dedicated coordination service for this.

Cluster Membership

"Which nodes are alive in my cluster right now?" Ephemeral nodes or lease-based keys solve this cleanly. When a node dies, its registration disappears.

This matters for: shard assignment, replica placement, task distribution, and consensus group membership.

When You Don't Need Coordination

Here's where teams over-engineer. Not every distributed problem requires consensus.

Caching

Redis handles caching. You don't need etcd for cache invalidation. A cache miss is not a correctness problem — it's a performance problem. Eventual consistency is fine.

BAD:  "Let's use ZooKeeper to coordinate cache invalidation
       across all servers so no one ever reads stale data."

GOOD: "Set a TTL on cache entries. Accept that some reads
       might be slightly stale. This is fine."

If you need stronger cache consistency, use a write-through cache with database triggers or CDC (change data capture). Still no coordination service needed.

Rate Limiting

Approximate rate limiting is almost always good enough. A sliding window counter in Redis (INCR + EXPIRE) gets you within 1-2% of the exact rate. Nobody cares if you allow 1,002 requests instead of 1,000.

BAD:  "We need a distributed lock to ensure exactly 1,000
       requests per minute."

GOOD: "Each server tracks its own counter. With N servers,
       set each limit to 1000/N. Close enough."

Stripe uses a per-server token bucket for most rate limiting. It's not perfectly accurate. It doesn't need to be.

Session Storage

User sessions belong in Redis, Memcached, or a database. Not in a coordination service. Sessions are per-user data, not cluster metadata. The read/write patterns are completely different.

Job Queues

Use SQS, RabbitMQ, or Kafka for job queues. Coordination services are designed for metadata that changes rarely — not for high-throughput message passing. If you're creating and deleting thousands of znodes per second, you're fighting the tool.

Metrics and Monitoring

Prometheus scrapes targets. Datadog agents push metrics. Neither needs a coordination service. Service discovery for monitoring targets can use DNS, Kubernetes labels, or cloud provider APIs.

Anti-Patterns

Anti-Pattern: ZooKeeper as a Database

ZooKeeper holds everything in memory on every node. The 1 MB per-znode limit exists for a reason. Storing user profiles, product catalogs, or any growing dataset in ZooKeeper will eventually crash your ensemble.

BAD: Storing 50,000 user records in ZooKeeper
     → 50,000 znodes × N replicas × memory
     → Watch storm on any update
     → Slow snapshot/restore

GOOD: Store user records in PostgreSQL.
      Use ZooKeeper only for "which PostgreSQL shard
      owns user range 1-10000?"

Rule of thumb: if your data grows with users or traffic, it doesn't belong in a coordination service.

Anti-Pattern: Distributed Lock Without Fencing

This is the most dangerous pattern. You acquire a lock, do work, release the lock. But what if the lock expires while you're still working?

WITHOUT FENCING:
  Client-A: acquire lock ✓
  Client-A: start writing to S3
  Client-A: GC pause for 30 seconds...
  Lock expires.
  Client-B: acquire lock ✓
  Client-B: start writing to S3
  Client-A: wakes up, finishes writing to S3
  → Both wrote. Data corrupted. Lock was useless.

WITH FENCING:
  Client-A: acquire lock, token=33 ✓
  Client-A: write to S3 with token=33
  Lock expires.
  Client-B: acquire lock, token=34 ✓
  Client-B: write to S3 with token=34
  Client-A: wakes up, tries token=33
  S3: rejects (33 < 34)
  → Safe.

The problem: most storage systems (S3, databases, file systems) don't natively support fencing tokens. You'd need to add a version column to your database table or use conditional writes (S3 conditional requests, DynamoDB conditions). If you can't add fencing to the downstream system, the distributed lock provides only a "best effort" guarantee.

Anti-Pattern: Coordination for Idempotent Operations

If the operation is idempotent (safe to repeat), you might not need a lock at all.

IDEMPOTENT: "Set user email to alice@example.com"
  → If two servers do this simultaneously, the result
    is the same. No lock needed.

NOT IDEMPOTENT: "Increment user balance by $50"
  → If two servers do this, you get $100. Need a lock
    or use atomic operations (database UPDATE ... SET
    balance = balance + 50).

For non-idempotent operations, consider atomic database operations (CAS, conditional updates) before reaching for a distributed lock. The database is already a coordination point — leverage it.

Alternative: Optimistic Concurrency Control

Instead of a lock: read the current version, do your work, then write with a version check. If the version changed, retry.

-- Read current version
SELECT version, balance FROM accounts WHERE id = 42;
-- version=7, balance=100

-- Update with version check
UPDATE accounts
SET balance = 150, version = 8
WHERE id = 42 AND version = 7;

-- If 0 rows affected: someone else updated. Retry.

This is optimistic: you assume no contention and only check at write time. For low-contention scenarios (most updates touch different rows), this is faster than acquiring a lock because there's no lock acquisition round trip.

DynamoDB uses this natively with conditional expressions. Firestore uses it with transactions and document versions. It's the default concurrency strategy in many modern databases.

When optimistic concurrency fails: high contention (many writers to the same key). Retries pile up. In that case, a pessimistic lock (or a queue) is better.

Decision Flowchart

Do you need multiple servers to agree on something?
├── No → Skip coordination. You're done.
├── Yes
│   ├── Is it leader election or cluster membership?
│   │   └── Yes → Use ZooKeeper/etcd. No shortcut.
│   ├── Is it a shared resource needing mutual exclusion?
│   │   ├── Is the operation idempotent?
│   │   │   └── Yes → No lock needed. Just do it.
│   │   ├── Can you use atomic DB operations (CAS)?
│   │   │   └── Yes → Use optimistic concurrency. Simpler.
│   │   ├── Do you need correctness (not just performance)?
│   │   │   ├── Yes → ZooKeeper/etcd lock WITH fencing token
│   │   │   └── No → Redis SET NX EX is fine
│   ├── Is it configuration or feature flags?
│   │   ├── Already have ZK/etcd? → Use it.
│   │   └── Don't have ZK/etcd? → Consider LaunchDarkly,
│   │       env vars, or config files + restart.
│   └── Is it service discovery?
│       ├── On Kubernetes? → Built-in. Skip.
│       ├── On AWS? → ELB + Cloud Map. Skip.
│       └── Bare metal / custom? → Consul or etcd.

The CAP Trade-off

ZooKeeper and etcd are both CP systems. They prioritize consistency over availability.

During a network partition:

The majority partition keeps working (can read and write).
The minority partition stops accepting writes. Reads may also fail depending on configuration.
When the partition heals, the minority catches up from the leader's log.

This means: if your coordination service is unavailable, any system depending on it for locks, leader election, or configuration is also effectively unavailable. You've coupled your system's availability to the coordination service's availability.

Spicy opinion: this is usually the right trade-off. If your leader election gives you two leaders during a partition, you have a correctness bug that's worse than downtime. CP is the correct choice for coordination. But understand that you're making it — and design your system so that coordination failures degrade gracefully rather than causing total outage.

Graceful degradation examples:

Lock acquisition fails: queue the operation and retry, or return an error to the user.
Config read fails: use the last known config (cached locally). Stale config is better than no config.
Service discovery fails: use the last known healthy instances. Stale list is better than no list.

Patterns for System Design Interviews

Pattern 1: "How do you prevent two servers from doing the same work?" First ask: is the work idempotent? If yes, let both do it. If no, consider optimistic concurrency (database CAS) before reaching for a distributed lock.

Pattern 2: "What if the coordination service goes down?" Show that you've thought about degradation. "The system continues with cached config / last known leader / stale service list. It doesn't crash — it just can't re-elect or re-configure until coordination is back."

Pattern 3: "Can you use Redis instead of ZooKeeper?" For performance-oriented locking (rate limiting, deduplication), yes. For correctness-critical locking (financial transactions, exactly-once processing), no — Redis doesn't guarantee that a write is replicated before acknowledging it.

Trade-offs Table

Approach	Consistency	Complexity	Latency	Best For
No coordination	Eventual / none	Lowest	Lowest	Idempotent operations
Redis lock (SET NX)	Best effort	Low	~1ms	Rate limiting, dedup
Database CAS	Strong (per-row)	Medium	~5-20ms	Low-contention updates
ZooKeeper/etcd lock	Linearizable	High	~5-15ms	Leader election, fencing
Consul	Strong	High	~5-15ms	Service discovery + locks

Optimistic Concurrency

Interview Gotchas

Gotcha 1: "Why not just use a database for leader election?" You can, but databases aren't optimized for fast failure detection. A database lock with a 30-second timeout means 30 seconds of downtime on leader failure. ZooKeeper/etcd detect failures in 6-10 seconds via session/lease expiry.

Gotcha 2: "Is a distributed lock always necessary for exactly-once processing?" No. Idempotency keys (store the operation ID, reject duplicates) often work better. Kafka's exactly-once semantics use transaction IDs, not distributed locks.

Gotcha 3: "Can two nodes both think they're the leader?" Yes, briefly. This is called a split-brain. It happens during network partitions. That's why fencing tokens matter — the downstream system (database, storage) is the final arbiter, not the lock itself.

Gotcha 4: "What's the difference between a mutex and a distributed lock?" A mutex protects shared memory in one process. A distributed lock protects a shared resource across processes on different machines. The failure modes are completely different — a process crash releases a mutex immediately, but a distributed lock requires a timeout or session expiry.

Gotcha 5: "Should I mention ZooKeeper or etcd in my design?" Only if you have a genuine coordination problem (leader election, distributed lock, config management). Don't add it "just in case." If the interviewer asks "how do you elect a leader?", say "ZooKeeper or etcd for leader election" and move on. Don't explain ZAB unless asked.