Vector Clocks, Lamport Timestamps, and Causal Ordering

TL;DR

Physical clocks lie. NTP drifts. Leap seconds happen. You cannot use wall-clock time to order events across machines. Lamport timestamps give you a total order that is consistent with causality but cannot detect concurrency. Vector clocks can detect causality and conflicts but grow linearly with the number of nodes. Version vectors (used by DynamoDB) are the practical evolution. Google TrueTime sidesteps the entire problem with GPS and atomic clocks, bounding uncertainty to ~7ms. Every distributed system makes a choice about how to order events. Understanding that choice tells you what the system can and cannot guarantee.

The Problem

Machine A writes x = 1 at 10:00:00.000. Machine B writes x = 2 at 10:00:00.001. Which write happened first?

You want to say "A happened first" because its timestamp is earlier. But Machine A's clock might be 50ms ahead of Machine B's clock. In reality, B's write might have happened first. NTP synchronization keeps clocks within a few milliseconds of each other, but "a few milliseconds" is an eternity when your database processes thousands of writes per second.

This is not a theoretical concern. In 2012, a leap second caused a Linux kernel bug that took down Reddit, Mozilla, Yelp, and large chunks of the internet. Clocks are not to be trusted.

The fundamental question: how do you order events in a distributed system when you cannot trust physical time?

The Algorithm

Happened-Before Relation

Lamport defined the happened-before relation (denoted ->) in his 1978 paper:

If a and b are events in the same process, and a comes before b, then a -> b.
If a is the sending of a message and b is the receipt of that message, then a -> b.
If a -> b and b -> c, then a -> c (transitivity).

Two events are concurrent (denoted a || b) if neither a -> b nor b -> a. Concurrent events have no causal relationship -- they cannot have influenced each other.

Lamport Timestamps

Lamport Clock Progression Across Processes

Each process maintains a counter C. Rules:

1. Before each event, increment C:
   C = C + 1

2. When sending a message, attach timestamp C.

3. When receiving a message with timestamp T:
   C = max(C, T) + 1

Property: If a -> b, then L(a) < L(b). The converse is NOT true: L(a) < L(b) does NOT imply a -> b.

Worked Example: Lamport Timestamps

Process P1:          Process P2:          Process P3:
C=0                  C=0                  C=0

a (C=1)
send to P2 (C=1) -->
                     recv (C=max(0,1)+1=2)
                     b (C=3)
                     send to P3 (C=3) -->
                                          recv (C=max(0,3)+1=4)
                                          c (C=5)
d (C=2)

Event a -> b -> c and timestamps 1 < 3 < 5. Correct.

But event d has timestamp 2. L(d) < L(b) = 3. Does d -> b? No. d and b are concurrent -- they happened independently. Lamport timestamps cannot distinguish "happened before" from "happened concurrently."

Why Lamport Timestamps Fall Short

Lamport timestamps provide a total order consistent with causality. This is useful for things like mutual exclusion and totally ordered broadcast. But they lose information. Given two timestamps, you cannot tell whether the events are causally related or concurrent.

This matters for conflict detection. If two users edit the same document and their edits are concurrent, you want to detect the conflict. Lamport timestamps cannot tell you the edits are concurrent -- they will give one edit a lower timestamp, implying a false ordering.

Vector Clocks

Vector Clock Causality Detection

Vector clocks solve this by having each process maintain a vector of counters, one per process.

Process Pi maintains vector VC[0..n-1] where n is the number of processes.

1. Before each event at process Pi:
   VC[i] = VC[i] + 1

2. When Pi sends a message, attach entire vector VC.

3. When Pi receives a message with vector T:
   For each j: VC[j] = max(VC[j], T[j])
   VC[i] = VC[i] + 1

Comparison rules:

VC(a) < VC(b) iff for all j, VC(a)[j] <= VC(b)[j] and for some j, VC(a)[j] < VC(b)[j].
VC(a) == VC(b) iff for all j, VC(a)[j] == VC(b)[j].
VC(a) || VC(b) (concurrent) iff neither VC(a) <= VC(b) nor VC(b) <= VC(a).

Worked Example: Vector Clocks

Process P1:              Process P2:              Process P3:
VC=[0,0,0]               VC=[0,0,0]               VC=[0,0,0]

a: VC=[1,0,0]
send to P2: VC=[1,0,0] -->
                          recv: VC=[1,1,0]  (max + inc P2)
                          b: VC=[1,2,0]
                          send to P3: VC=[1,2,0] -->
                                                    recv: VC=[1,2,1]
                                                    c: VC=[1,2,2]
d: VC=[2,0,0]

Compare b=[1,2,0] with d=[2,0,0]: - b[0]=1 < d[0]=2 but b[1]=2 > d[1]=0. - Neither dominates. b and d are concurrent. Exactly right.

Compare a=[1,0,0] with b=[1,2,0]: - a[0]=1 <= b[0]=1, a[1]=0 <= b[1]=2, a[2]=0 <= b[2]=0. - a <= b. a happened before b. Exactly right.

Version Vectors

Version vectors are a practical adaptation of vector clocks used by distributed databases. Instead of tracking every event, they track versions of a specific data item per replica.

When DynamoDB detects a write conflict:

Replica A writes key "user:123":  VV = {A: 1}
Replica B writes key "user:123":  VV = {B: 1}

Client reads from A: gets value_A with VV {A: 1}
Client reads from B: gets value_B with VV {B: 1}

Compare: {A:1} vs {B:1}. Neither dominates. Conflict.
DynamoDB returns both versions. Application resolves.

After resolution:

Client sends resolved value to A: VV = {A: 2, B: 1}
A replicates to B. B sees {A:2, B:1} dominates {B:1}. Accepts.

The practical advantage: version vectors only grow with the number of replicas, not the number of events. For a system with 3 replicas, the vector has 3 entries regardless of how many million writes happen.

Google TrueTime

Google took a completely different approach with Spanner. Instead of logical clocks, they made physical clocks trustworthy.

TrueTime API returns an interval [earliest, latest] instead of a single timestamp. The interval represents uncertainty in the current time.

TT.now() → TTinterval {earliest, latest}
TT.after(t) → true if t is definitely in the past
TT.before(t) → true if t is definitely in the future

How they achieve this:

GPS receivers and atomic clocks in every datacenter.
Time masters synchronize servers every 30 seconds.
Uncertainty is typically 1-7ms (mostly due to communication delay to time master).

Spanner's commit protocol: When committing a transaction at timestamp s, Spanner waits until TT.after(s) is true before making the transaction visible. This guarantees that any transaction that starts after the commit will see a higher timestamp. The wait is bounded by the uncertainty interval, typically ~7ms.

The trade-off is real: you need GPS receivers and atomic clocks in every datacenter. You need a global time synchronization infrastructure. This is why only Google (and a few other hyperscalers) do this.

Proof/Correctness Intuition

Why Vector Clocks Capture Causality

Vector clocks work because they encode the full causal history of each event.

VC(a)[i] represents: "at the time of event a, process Pi had executed this many events that causally precede a (or are a itself)."

If a -> b, then every event in a's causal history is also in b's causal history (by transitivity). Therefore VC(a)[j] <= VC(b)[j] for all j. Conversely, if VC(a) < VC(b), there exists a chain of messages connecting a's history to b.

If a || b, then each process has events in its history that the other does not know about. This means there exists some j where VC(a)[j] > VC(b)[j] and some k where VC(a)[k] < VC(b)[k]. The vectors are incomparable.

Why Lamport Timestamps Cannot Detect Concurrency

Lamport timestamps compress the entire causal history into a single integer. This compression is lossy. You cannot recover whether two events are causally related or concurrent from their timestamps alone. The total order imposed by Lamport timestamps is consistent with causality but includes arbitrary ordering of concurrent events.

Real-World Usage

System	Clock Mechanism	Why
DynamoDB	Version vectors	Detect conflicts in multi-master replication
Riak	Vector clocks (originally), dotted version vectors	Conflict detection, sibling resolution
Cassandra	Lamport timestamps (LWW)	Last-writer-wins, no conflict detection
Google Spanner	TrueTime (GPS + atomic clocks)	External consistency (linearizability)
CockroachDB	Hybrid Logical Clocks (HLC)	Combines physical time + logical counter
MongoDB	Lamport-like cluster time	Causal consistency sessions

Hybrid Logical Clocks (HLC), used by CockroachDB, are worth knowing. They combine physical time and logical counters:

HLC = (physical_time, logical_counter, node_id)

On local event:
  pt = max(HLC.pt, wall_clock)
  if pt == HLC.pt: lc = HLC.lc + 1
  else: lc = 0

On receive(msg_hlc):
  pt = max(HLC.pt, msg_hlc.pt, wall_clock)
  if pt == HLC.pt == msg_hlc.pt: lc = max(HLC.lc, msg_hlc.lc) + 1
  elif pt == HLC.pt: lc = HLC.lc + 1
  elif pt == msg_hlc.pt: lc = msg_hlc.lc + 1
  else: lc = 0

HLCs give you the causal ordering of Lamport timestamps plus timestamps that are close to physical time. CockroachDB uses them for MVCC (multi-version concurrency control).

Interview Application

When to mention clocks and ordering:

"How does your multi-region database handle conflicting writes?" -- Vector clocks or version vectors for conflict detection, or LWW with Lamport timestamps if you accept data loss.
"How does Spanner achieve external consistency?" -- TrueTime with bounded uncertainty. Commit-wait protocol.
"How do you order events in a distributed system?" -- Depends on what you need. Total order: Lamport timestamps. Causality detection: vector clocks. Real time: TrueTime.

What interviewers want to hear:

You know physical clocks are unreliable in distributed systems.
You can explain the happened-before relation.
You understand the trade-off between Lamport timestamps (simple, no conflict detection) and vector clocks (conflict detection, higher overhead).
You know that Cassandra uses LWW (and can lose data), while DynamoDB uses version vectors (and can detect conflicts).

Trade-offs

Mechanism	Overhead	Detects Causality	Detects Conflicts	Real Time
Lamport Timestamps	1 integer	No	No	No
Vector Clocks	n integers	Yes	Yes	No
Version Vectors	r integers (replicas)	Per-key	Per-key	No
HLC	2 integers + ID	Partial	No	Approximate
TrueTime	1 interval	Implicit	N/A (serialized)	Yes

The Cassandra trade-off: Cassandra uses last-writer-wins (LWW) with timestamps. If two clients write to the same key concurrently, the write with the higher timestamp survives, and the other is silently lost. No conflict detection. This is a deliberate choice: simplicity and performance over correctness. It works fine for metrics and logs. It is dangerous for financial transactions.

The DynamoDB trade-off: DynamoDB detects conflicts and returns both versions to the client. The application must resolve the conflict. This pushes complexity to the application layer. Amazon's shopping cart uses this: if two concurrent writes add different items to the cart, merge them. Better to have an extra item in the cart than to lose one.

Common Mistakes

NTP makes clocks accurate enough

NTP keeps clocks within a few milliseconds of each other under normal conditions. But NTP can jump (step correction), drift (frequency error), and fail (network partition to NTP server). Any system that relies on "millisecond-accurate NTP" for correctness will eventually lose data.

Lamport timestamps tell you which event happened first

Lamport timestamps give you a total order consistent with causality. If L(a) < L(b), you know a did not happen after b. But you cannot conclude a happened before b -- they might be concurrent. The converse of a -> b implies L(a) < L(b) is not true.

Vector clocks scale to large clusters

Vector clocks have one entry per process. In a system with 1000 nodes, every event carries a vector of 1000 integers. This is why Riak moved from vector clocks to dotted version vectors, and why most systems use version vectors (one entry per replica, not per client).

TrueTime eliminates the need for consensus

TrueTime provides bounded clock uncertainty. Spanner still uses Paxos for consensus within each shard. TrueTime enables externally consistent transactions across shards by providing a global time reference. It complements consensus; it does not replace it.

Last-writer-wins is always bad

LWW is a valid strategy for data that is append-only, idempotent, or low-value. Metrics, logs, and caches are fine with LWW. The problem arises when you use LWW for data where silent loss is unacceptable: user accounts, financial records, or shopping carts.

You always need vector clocks for a distributed database

If your database is single-leader (all writes go through one node), you do not need vector clocks. The leader provides a total order. Vector clocks are necessary for multi-leader or leaderless architectures where concurrent writes to the same key are possible.