Advanced12 min readlive prototype

ZooKeeper Election

Every node creates an ephemeral sequential znode; the smallest sequence is leader. Each follower watches only its predecessor — no herd effect.

Overview

What this concept solves

ZooKeeper leader election is the textbook recipe baked into the ZooKeeper data model itself. Every contender creates an ephemeral sequential znode under /election. The znode names look like node-0000000001, node-0000000002, ... — ZooKeeper assigns the sequence number atomically on create. The rule is brutally simple: the process whose znode has the smallest sequence number is the leader. Everyone else is a follower.

Two pieces of ZooKeeper machinery make this election cheap and safe. Ephemeral znodes disappear automatically when the owning client's session ends — crash, network gap, voluntary close, doesn't matter. So a dead leader's znode just vanishes, no separate liveness check needed. Watches let each follower be notified when a specific znode changes. The trick — published as a recipe by the ZooKeeper team — is that each follower watches only the znode immediately before it in the sorted list, not the leader. This single rule eliminates the herd effect: when one znode disappears, exactly one follower wakes up, not the entire cluster.

When the leader dies, only the second-smallest process is notified. It re-reads /election, sees no smaller znode exists, and declares itself leader. When a mid-chain follower dies, only the node after it gets notified — and after re-reading, that node still sees a smaller znode (so it isn't leader yet), so it just re-points its watch to its new predecessor. No election happens, no leader changes. This watcher-chain recipe is the most elegant production leader-election scheme — and a perfect example of how ZooKeeper's tiny primitive set composes into bigger guarantees.

Mechanics

How it works

The recipe — "Lock & Leader Election" from the ZooKeeper docs

  1. Each contender connects to the ZK ensemble and gets a unique session.
  2. It creates /election/node- with the EPHEMERAL_SEQUENTIAL flag. ZK returns the full name with an atomic sequence number, e.g. /election/node-0000000003.
  3. It lists /election's children and sorts by sequence number.
  4. If its own znode has the smallest sequence: it is the leader. Take leadership and start work.
  5. If not: identify the znode immediately before its own. Set a watch on exactly that one znode. Sleep.
  6. When the watched znode is deleted, the watch fires. Repeat steps 3–5.

Why "watch only your predecessor" matters

  • Herd avoidance — if everyone watched the leader instead, all n-1 followers would wake up simultaneously on a leader death and stampede ZK with re-reads. The predecessor-watch sends exactly one notification per death.
  • Locality — each follower's reaction is fast (one watch fire, one read, one compare) and bounded.
  • Healing without elections — a non-leader death triggers only its successor, which re-watches its new predecessor and goes back to sleep. The leader keeps leading, no election runs.
  • Fence token built in — the znode's zxid (ZooKeeper transaction id) is a monotonically increasing token the new leader can carry on subsequent writes.

Why ephemeral?

An ephemeral znode is tied to the creating client's session. When the session ends — crash, partition past the session timeout, voluntary close — ZooKeeper itself deletes the znode. There is no application code to write for "detect that I died." The session timeout is the lease TTL equivalent (typically 2 × tickTime to 20 × tickTime, e.g. 6–60 s). Choose it as you would a lease TTL: shorter = faster failover, more sensitive to brief pauses.

The "is my znode still there?" reconnect check

After a network blip, a client might reconnect to a different ZK server. The session may have been preserved (good) or expired (bad). When the client reconnects, it should re-verify it still owns its znode before assuming it's still in the election — and certainly before assuming it's still leader. Most ZK client libraries (Curator's LeaderLatch / LeaderSelector) handle this for you; if you implement the recipe yourself, this is the single most common bug.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

About this simulation

Five processes electing a leader through ephemeral sequential znodes under /election. Pick a scenario — Join & elect (each process creates its znode, smallest wins), Leader dies, no herd (the leader's session expires; only its single watcher is notified), or Mid-chain death (a follower dies and its watcher re-points to the new predecessor; leader is unaffected). Free play lets you kill any node and watch the watch-chain reshuffle; the log card below holds only the last two messages.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Walk Join & elect

Open Join & elect and step through. P1 connects and creates /election/node-0000000001 — it has the smallest sequence, so it's leader. P2 creates node-0000000002, sees a smaller predecessor, watches node-0000000001. P3..P5 do the same, each watching exactly the one before. End state: one leader, four watchers, each pointing at a different predecessor. Notice how the cost of joining is constant — one create, one list, one watch.

try 02

Walk Leader dies, no herd

Run Leader dies, no herd. P1's session ends; ZK deletes node-0000000001. Only P2's watch fires (P3..P5 were watching different znodes). P2 re-lists /election, sees no smaller znode, and declares itself leader. P3..P5 are entirely unbothered — they didn't even wake up. That single-notification behaviour is the whole reason the predecessor-watch recipe exists.

try 03

Walk Mid-chain death

Run Mid-chain death. P3 dies; ZK deletes node-0000000003. Only P4's watch fires. P4 re-lists, finds it still has a smaller znode (node-0000000002), so it is not leader — it just re-points its watch to node-0000000002 (its new predecessor) and goes back to sleep. No election. No leader change. The cluster heals itself in one round trip. This is the cheap-failure mode you almost never see in Raft.

try 04

Free play — break it yourself

Open Free play. Crash any process and watch which single follower is notified. Try crashing the second-smallest before the leader — nothing happens to the leader, just the next-after gets re-watched. Try crashing P5 (the largest) — no one is watching it, so absolutely nothing happens. Try crashing two non-adjacent followers at once — two independent watch fires, two re-points, no election. The lazy locality is the appeal.

In practice

When to use it — and what you give up

When it's the right tool

  • You already run ZooKeeper — Kafka pre-KRaft, HBase, Solr, HDFS, Hive, classic Hadoop stack. Lease election against ZK is free.
  • You want true herd-free notification — the watch-chain recipe is the cleanest answer; a single death notifies exactly one node.
  • Mass elections across many groups — many independent /election/<group-id>/ paths can coexist; one ZK ensemble can host hundreds of leadership groups.
  • You want a fence token without thinking about it — the zxid is monotonically increasing across all operations in the ensemble.

When to reach for something else

  • You don't run ZooKeeper — bootstrapping ZK for leader election is overkill. Use etcd lease or run Raft directly.
  • Sub-second failover — session timeout has to outlive a brief pause, so failover is bounded by the session timeout (typically several seconds). Raft heartbeats are faster.
  • Modern Kafka deployments — KRaft mode (since Kafka 3.3) replaces ZooKeeper with internal Raft. New projects rarely choose ZK in 2026.
  • Multi-region setups — ZK's quorum and session model don't love high-latency WANs. Per-region leaders or a region-aware design beats one global ensemble.

Pros

  • Herd-free by construction — only one follower wakes up on any single death.
  • Leader liveness handled by ZK — ephemeral znode + session timeout = no separate liveness protocol.
  • Built-in fence token — every znode operation has a monotonically increasing zxid.
  • Composes — the same ensemble hosts many independent election groups, distributed locks, config storage, and more.
  • Battle-tested recipe — published as recipes/recipes-elections.html in the ZooKeeper docs; used by Hadoop, Solr, HBase, classic Kafka.

Cons

  • You need ZooKeeper — running it just for election is heavy compared to a single Raft library inside your service.
  • Failover is bounded by session timeout — typically multiple seconds; not sub-second.
  • ZK clients are notoriously easy to misuse — session lifecycle handling and reconnect logic are the bug sources of the JVM era. Use Curator.
  • ZK ensemble operations are a separate skill — observers, dynamic reconfig, snapshot management, transaction-log compaction — a whole second stack to operate.
  • Newer alternatives — etcd lease election is functionally equivalent with a simpler client model and gRPC.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

zk-election.ts
// ZooKeeper leader election (predecessor-watch recipe).
// The ZooKeeper client API is abstracted; in production, use Apache Curator.
type ZkPath = string;
interface ZkClient {
  createEphemeralSequential(parent: ZkPath, data: Uint8Array): Promise<ZkPath>;
  /** Returns the full paths of every child, sorted by sequence number. */
  getChildrenSorted(parent: ZkPath): Promise<ZkPath[]>;
  /** Set a one-shot watch; resolves when the watched node is deleted. */
  watchNodeDeletion(path: ZkPath): Promise<void>;
  /** Returns true if the path still exists. */
  exists(path: ZkPath): Promise<boolean>;
}

class ZkLeader {
  private myZnode: ZkPath | null = null;

  constructor(private readonly zk: ZkClient, private readonly base = "/election") {}

  /** Join the election and block until we become leader. */
  async leadOrBlock(): Promise<void> {
    this.myZnode = await this.zk.createEphemeralSequential(this.base, new Uint8Array());

    // Loop: re-check who's smallest after each predecessor death.
    while (true) {
      const children = await this.zk.getChildrenSorted(this.base);
      const myIdx = children.indexOf(this.myZnode);
      if (myIdx === -1) throw new Error("my znode vanished — session expired?");
      if (myIdx === 0) return;                          // smallest → leader

      const predecessor = children[myIdx - 1];
      // Watch only the predecessor — herd avoidance.
      if (await this.zk.exists(predecessor)) {
        await this.zk.watchNodeDeletion(predecessor);   // resolves on delete
      }
      // The predecessor died. Loop back, re-check, possibly re-watch.
    }
  }

  /** Voluntary step-down. */
  async resign(): Promise<void> {
    // Deleting the ephemeral znode triggers the watch on our successor.
    this.myZnode = null;
    // (Equivalent: close the session — ZK deletes it for us.)
  }
}

References & further reading

7 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. No scoring saved.

question 01 / 03

In the ZooKeeper election recipe, which process is the leader?

question 02 / 03

Why does each follower watch only its predecessor's znode instead of watching the leader's znode?

question 03 / 03

A non-leader process dies in a five-node ZK election. What happens?

0/3 answered

Was this concept helpful?

Tell us what worked, or what to improve. We read every note.