ZAB — ZooKeeper Atomic Broadcast — Consensus

Overview

What this concept solves

ZAB — ZooKeeper Atomic Broadcast — is the consensus protocol inside Apache ZooKeeper, designed by Flavio Junqueira, Benjamin Reed, and Marco Serafini in 2008. The job is slightly narrower than general consensus: ZAB delivers a total order of write transactions to a quorum-replicated service. Every server applies the same operations in the same order, which is exactly what ZooKeeper needs to run a hierarchical key-value store with strong consistency.

ZAB looks superficially like Multi-Paxos or Raft: there's a leader, a quorum, and a per-write propose/ack/commit cycle. What sets it apart is its synchronization phase: when a new leader is elected, before serving any new writes, it makes every follower's log byte-for-byte identical to its own. That hard sync simplifies the steady-state protocol — by the time normal broadcast begins, every server is exactly caught up — and it's why ZooKeeper offers such crisp ordering guarantees.

ZAB's transactions carry a zxid — a 64-bit ID combining a 32-bit epoch (incremented on every leader change) and a 32-bit counter (incremented on every write). The election rule is dead simple: the server with the highest zxid wins, because by ZAB's invariants it must already hold every committed transaction. The sync phase then ships that history to anyone behind. After sync, the leader just streams Propose → ACK → Commit in strict FIFO order forever — that's the "atomic broadcast" in the name.

Mechanics

How it works

Phase 1 — Leader election (by highest zxid)

Each server advertises its (epoch, zxid) — i.e., how much committed history it has.
The server with the highest zxid wins. By ZAB's invariants this server must have all committed transactions in the previous epoch, so nothing can be lost.
Ties are broken deterministically (typically by server ID). Either way the result is unique.

Phase 2 — Discovery and Synchronization

New leader picks an epoch number higher than any seen and broadcasts NEW-EPOCH(e) to followers.
Followers acknowledge with their last seen zxid. The leader collects from a quorum, then ships any missing transactions to each follower (SYNC) and drops any stray uncommitted transactions on followers' logs.
Once a quorum is fully synced — every follower's log exactly matches the leader's — broadcast is allowed to begin.

Phase 3 — Broadcast (steady state)

Client write hits the leader. Leader assigns the next zxid (epoch, counter+1) and sends PROPOSE(zxid, txn) to every follower.
Each follower writes the txn to its log (tentative) and replies ACK.
When a quorum of ACKs arrives, the leader sends COMMIT(zxid); all servers apply the txn to their state machine in zxid order.

Why ZAB chose synchronisation upfront

Raft and Multi-Paxos tolerate per-follower log gaps and patch them lazily via AppendEntries. ZAB instead front-loads the cost: bring everyone up to date before broadcast begins. That makes the per-write hot path trivial (no per-follower bookkeeping) and the ordering guarantee straightforward to reason about — at the cost of a more expensive election.

ZAB only delivers, in zxid order, what passed through the leader

ZAB provides FIFO total order — transactions from the leader appear in zxid order on every server. It does NOT make ZooKeeper a general-purpose state machine library: clients see operations strictly through ZooKeeper's API. The atomic-broadcast guarantee is what the service is built on, not what it exports directly.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › ZAB — ZooKeeper Atomic Broadcast

About this simulation

Five ZooKeeper servers running ZAB. Pick a scenario — Elect by highest zxid, Sync followers, Broadcast a write (propose → ack → commit), or Crash the leader to trigger a new epoch. Free play queues writes and crashes; the log card shows only the last two messages.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch the election pick the highest-zxid server

Open Elect by highest zxid. Notice the small zxid epoch:count under each server — that's how much committed history each holds. The protocol picks the one with the highest, because by ZAB's invariant it must already have every committed transaction. Ties go to lowest server ID.

try 02

Watch synchronization fix every follower

Step through Sync followers. The new leader opens a higher epoch and ships any missing transactions to followers that are behind. By the end of sync, every live server's log is identical to the leader's — and only then is broadcast allowed to begin. Compare to Raft, which tolerates gaps and patches lazily.

try 03

Broadcast a write in one RTT

Switch to Broadcast (or queue a command after sync). The leader sends PROPOSE → quorum ACKs → COMMIT. Every server applies the txn in zxid order. Throughput-wise this is the same hot path as Raft and Multi-Paxos.

try 04

Free play — crash the leader

Open Free play. Queue a few writes, then crash the leader. Watch a new epoch open with the next-highest-zxid survivor, sync run, and broadcast resume. Try crashing the leader mid-broadcast (after Propose but before Commit) — the new leader's sync will either ship the pending txn forward or drop it, depending on whether a quorum had ACKed.

In practice

When to use it — and what you give up

When it's the right tool

Coordination services — leader election, distributed locks, configuration metadata, service discovery. This is exactly what ZooKeeper is for.
Strong FIFO ordering of writes is a hard requirement — every replica must see the same operations in the same order.
You're operating an existing ZooKeeper-based stack — Hadoop, HBase, Solr, Kafka (pre-KRaft), Druid. ZAB is what you already have running.

When to reach for something else

Building a new replicated state machine from scratch in 2026 — pick Raft. Same shape, broader ecosystem, less ZooKeeper-specific machinery.
Atomic commit across services — that's 2PC, not ZAB.
Byzantine fault tolerance — ZAB assumes crash failures only. Use PBFT or descendants.
Multi-leader / leaderless writes — ZAB is strictly single-leader. Look at EPaxos, Cassandra-style LWT, or CRDT-based approaches.

Pros

Strict FIFO total order of writes — the strongest ordering guarantee, easy to reason about.
Steady-state broadcast is minimal — Propose → ACK → Commit, one RTT per write.
Battle-tested for nearly two decades in ZooKeeper deployments at hyperscale (Yahoo, Twitter, LinkedIn).
Crisp election rule — highest zxid wins, no ambiguity, no log-completeness check across multiple criteria.
Synchronization upfront simplifies steady-state protocol — followers' logs are identical at broadcast time.

Cons

Slower leader recovery than Raft — the sync phase pays upfront for what Raft amortises across heartbeats.
Tightly coupled to ZooKeeper's needs — it isn't a general consensus library you can drop into your own service.
Single-leader bottleneck — same as Multi-Paxos and Raft.
Less library tooling than Raft outside the ZooKeeper codebase — re-implementing ZAB cleanly is rare.
Documentation gap — academic papers describe ZAB precisely, but the production ZooKeeper code has accumulated divergences worth knowing.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

zab.go

// Skeleton of ZAB's three modes: discovery, synchronization, broadcast.
package zab

import (
  "errors"
  "sync"
)

type Zxid struct {
  Epoch   int
  Counter int
}

type LogEntry struct {
  Zxid      Zxid
  Txn       string
  Committed bool
}

// ZabServer is the peer interface; helpers maxAdvertisedEpoch,
// collectAcks and majority are provided externally.
type ZabServer interface {
  NewEpoch(epoch int) error
  LastZxid() (Zxid, error)
  ApplySync(missing []LogEntry) error
  Propose(zxid Zxid, txn string) error
  Commit(zxid Zxid) error
}

type ZabLeader struct {
  epoch   int
  counter int
  log     []LogEntry
  peers   []ZabServer
  myID    int
}

func NewZabLeader(peers []ZabServer, myID int) *ZabLeader {
  return &ZabLeader{epoch: 0, peers: peers, myID: myID}
}

// --- Discovery: claim a fresh epoch ---
func (l *ZabLeader) Discovery() error {
  highest, err := maxAdvertisedEpoch(l.peers)
  if err != nil {
    return err
  }
  l.epoch = highest + 1
  var wg sync.WaitGroup
  errs := make([]error, len(l.peers))
  for i, p := range l.peers {
    wg.Add(1)
    go func(i int, p ZabServer) {
      defer wg.Done()
      errs[i] = p.NewEpoch(l.epoch)
    }(i, p)
  }
  wg.Wait()
  for _, e := range errs {
    if e != nil {
      return e
    }
  }
  return nil
}

// --- Synchronization: ship missing txns, drop uncommitted strays ---
func (l *ZabLeader) Synchronize() error {
  for _, p := range l.peers {
    theirLast, err := p.LastZxid()
    if err != nil {
      return err
    }
    var missing []LogEntry
    for _, e := range l.log {
      if zxidGt(e.Zxid, theirLast) {
        missing = append(missing, e)
      }
    }
    if err := p.ApplySync(missing); err != nil {
      return err
    }
    // any uncommitted entries on p that we don't have get dropped on p
  }
  return nil
}

// --- Broadcast: Propose -> Ack -> Commit ---
func (l *ZabLeader) Broadcast(txn string) error {
  l.counter++
  zxid := Zxid{Epoch: l.epoch, Counter: l.counter}
  l.log = append(l.log, LogEntry{Zxid: zxid, Txn: txn, Committed: false})

  acks := collectAcks(l.peers, func(p ZabServer) error {
    return p.Propose(zxid, txn)
  })
  if len(acks) < majority(len(l.peers)) {
    return errors.New("no quorum")
  }

  var wg sync.WaitGroup
  for _, p := range l.peers {
    wg.Add(1)
    go func(p ZabServer) {
      defer wg.Done()
      p.Commit(zxid)
    }(p)
  }
  wg.Wait()
  for i := range l.log {
    if zxidEq(l.log[i].Zxid, zxid) {
      l.log[i].Committed = true
      break
    }
  }
  return nil
}

func zxidGt(a, b Zxid) bool {
  return a.Epoch > b.Epoch || (a.Epoch == b.Epoch && a.Counter > b.Counter)
}

func zxidEq(a, b Zxid) bool {
  return a.Epoch == b.Epoch && a.Counter == b.Counter
}

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

What's the safety reason ZAB elects the server with the highest zxid?

question 02 / 03

Why does ZAB synchronize all followers up to the leader before broadcast begins, instead of patching lazily on each write like Raft does?

question 03 / 03

A ZAB transaction's zxid is built from two parts. What are they, and why?

0/3 answered