Advanced12 min readlive prototype

ZAB — ZooKeeper Atomic Broadcast

ZooKeeper's variant. Elect by highest zxid, open a new epoch, sync followers up to the leader, then propose/ACK/commit in strict FIFO order.

Overview

What this concept solves

ZAB — ZooKeeper Atomic Broadcast — is the consensus protocol inside Apache ZooKeeper, designed by Flavio Junqueira, Benjamin Reed, and Marco Serafini in 2008. The job is slightly narrower than general consensus: ZAB delivers a total order of write transactions to a quorum-replicated service. Every server applies the same operations in the same order, which is exactly what ZooKeeper needs to run a hierarchical key-value store with strong consistency.

ZAB looks superficially like Multi-Paxos or Raft: there's a leader, a quorum, and a per-write propose/ack/commit cycle. What sets it apart is its synchronization phase: when a new leader is elected, before serving any new writes, it makes every follower's log byte-for-byte identical to its own. That hard sync simplifies the steady-state protocol — by the time normal broadcast begins, every server is exactly caught up — and it's why ZooKeeper offers such crisp ordering guarantees.

ZAB's transactions carry a zxid — a 64-bit ID combining a 32-bit epoch (incremented on every leader change) and a 32-bit counter (incremented on every write). The election rule is dead simple: the server with the highest zxid wins, because by ZAB's invariants it must already hold every committed transaction. The sync phase then ships that history to anyone behind. After sync, the leader just streams Propose → ACK → Commit in strict FIFO order forever — that's the "atomic broadcast" in the name.

Mechanics

How it works

Phase 1 — Leader election (by highest zxid)

  1. Each server advertises its (epoch, zxid) — i.e., how much committed history it has.
  2. The server with the highest zxid wins. By ZAB's invariants this server must have all committed transactions in the previous epoch, so nothing can be lost.
  3. Ties are broken deterministically (typically by server ID). Either way the result is unique.

Phase 2 — Discovery and Synchronization

  1. New leader picks an epoch number higher than any seen and broadcasts NEW-EPOCH(e) to followers.
  2. Followers acknowledge with their last seen zxid. The leader collects from a quorum, then ships any missing transactions to each follower (SYNC) and drops any stray uncommitted transactions on followers' logs.
  3. Once a quorum is fully synced — every follower's log exactly matches the leader's — broadcast is allowed to begin.

Phase 3 — Broadcast (steady state)

  1. Client write hits the leader. Leader assigns the next zxid (epoch, counter+1) and sends PROPOSE(zxid, txn) to every follower.
  2. Each follower writes the txn to its log (tentative) and replies ACK.
  3. When a quorum of ACKs arrives, the leader sends COMMIT(zxid); all servers apply the txn to their state machine in zxid order.

Why ZAB chose synchronisation upfront

Raft and Multi-Paxos tolerate per-follower log gaps and patch them lazily via AppendEntries. ZAB instead front-loads the cost: bring everyone up to date before broadcast begins. That makes the per-write hot path trivial (no per-follower bookkeeping) and the ordering guarantee straightforward to reason about — at the cost of a more expensive election.

ZAB only delivers, in zxid order, what passed through the leader

ZAB provides FIFO total order — transactions from the leader appear in zxid order on every server. It does NOT make ZooKeeper a general-purpose state machine library: clients see operations strictly through ZooKeeper's API. The atomic-broadcast guarantee is what the service is built on, not what it exports directly.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

About this simulation

Five ZooKeeper servers running ZAB. Pick a scenario — Elect by highest zxid, Sync followers, Broadcast a write (propose → ack → commit), or Crash the leader to trigger a new epoch. Free play queues writes and crashes; the log card shows only the last two messages.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch the election pick the highest-zxid server

Open Elect by highest zxid. Notice the small zxid epoch:count under each server — that's how much committed history each holds. The protocol picks the one with the highest, because by ZAB's invariant it must already have every committed transaction. Ties go to lowest server ID.

try 02

Watch synchronization fix every follower

Step through Sync followers. The new leader opens a higher epoch and ships any missing transactions to followers that are behind. By the end of sync, every live server's log is identical to the leader's — and only then is broadcast allowed to begin. Compare to Raft, which tolerates gaps and patches lazily.

try 03

Broadcast a write in one RTT

Switch to Broadcast (or queue a command after sync). The leader sends PROPOSE → quorum ACKs → COMMIT. Every server applies the txn in zxid order. Throughput-wise this is the same hot path as Raft and Multi-Paxos.

try 04

Free play — crash the leader

Open Free play. Queue a few writes, then crash the leader. Watch a new epoch open with the next-highest-zxid survivor, sync run, and broadcast resume. Try crashing the leader mid-broadcast (after Propose but before Commit) — the new leader's sync will either ship the pending txn forward or drop it, depending on whether a quorum had ACKed.

In practice

When to use it — and what you give up

When it's the right tool

  • Coordination services — leader election, distributed locks, configuration metadata, service discovery. This is exactly what ZooKeeper is for.
  • Strong FIFO ordering of writes is a hard requirement — every replica must see the same operations in the same order.
  • You're operating an existing ZooKeeper-based stack — Hadoop, HBase, Solr, Kafka (pre-KRaft), Druid. ZAB is what you already have running.

When to reach for something else

  • Building a new replicated state machine from scratch in 2026 — pick Raft. Same shape, broader ecosystem, less ZooKeeper-specific machinery.
  • Atomic commit across services — that's 2PC, not ZAB.
  • Byzantine fault tolerance — ZAB assumes crash failures only. Use PBFT or descendants.
  • Multi-leader / leaderless writes — ZAB is strictly single-leader. Look at EPaxos, Cassandra-style LWT, or CRDT-based approaches.

Pros

  • Strict FIFO total order of writes — the strongest ordering guarantee, easy to reason about.
  • Steady-state broadcast is minimal — Propose → ACK → Commit, one RTT per write.
  • Battle-tested for nearly two decades in ZooKeeper deployments at hyperscale (Yahoo, Twitter, LinkedIn).
  • Crisp election rule — highest zxid wins, no ambiguity, no log-completeness check across multiple criteria.
  • Synchronization upfront simplifies steady-state protocol — followers' logs are identical at broadcast time.

Cons

  • Slower leader recovery than Raft — the sync phase pays upfront for what Raft amortises across heartbeats.
  • Tightly coupled to ZooKeeper's needs — it isn't a general consensus library you can drop into your own service.
  • Single-leader bottleneck — same as Multi-Paxos and Raft.
  • Less library tooling than Raft outside the ZooKeeper codebase — re-implementing ZAB cleanly is rare.
  • Documentation gap — academic papers describe ZAB precisely, but the production ZooKeeper code has accumulated divergences worth knowing.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

zab.ts
// Skeleton of ZAB's three modes: discovery, synchronization, broadcast.
type Zxid = { epoch: number; counter: number };

class ZabLeader {
  epoch: number;
  counter = 0;
  log: { zxid: Zxid; txn: string; committed: boolean }[] = [];

  constructor(private peers: ZabServer[], private myId: number) {
    this.epoch = 0;
  }

  // --- Discovery: claim a fresh epoch ---
  async discovery() {
    this.epoch = await maxAdvertisedEpoch(this.peers) + 1;
    await Promise.all(this.peers.map(p => p.newEpoch(this.epoch)));
  }

  // --- Synchronization: ship missing txns, drop uncommitted strays ---
  async synchronize() {
    for (const p of this.peers) {
      const theirLast = await p.lastZxid();
      const missing = this.log.filter(e => zxidGt(e.zxid, theirLast));
      await p.applySync(missing);
      // any uncommitted entries on p that we don't have get dropped on p
    }
  }

  // --- Broadcast: Propose -> Ack -> Commit ---
  async broadcast(txn: string) {
    this.counter++;
    const zxid: Zxid = { epoch: this.epoch, counter: this.counter };
    this.log.push({ zxid, txn, committed: false });

    const acks = await collectAcks(this.peers, p => p.propose(zxid, txn));
    if (acks.length < majority(this.peers.length)) throw new Error("no quorum");

    await Promise.all(this.peers.map(p => p.commit(zxid)));
    this.log.find(e => zxidEq(e.zxid, zxid))!.committed = true;
  }
}

function zxidGt(a: Zxid, b: Zxid) {
  return a.epoch > b.epoch || (a.epoch === b.epoch && a.counter > b.counter);
}
function zxidEq(a: Zxid, b: Zxid) { return a.epoch === b.epoch && a.counter === b.counter; }

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. No scoring saved.

question 01 / 03

What's the safety reason ZAB elects the server with the highest zxid?

question 02 / 03

Why does ZAB synchronize all followers up to the leader *before* broadcast begins, instead of patching lazily on each write like Raft does?

question 03 / 03

A ZAB transaction's zxid is built from two parts. What are they, and why?

0/3 answered

Was this concept helpful?

Tell us what worked, or what to improve. We read every note.