Leader Election
How a cluster picks one node to be in charge — and how it picks again when that node falls over. Five algorithms, from the textbook ID-based shouting matches to the lease-and-watcher schemes real coordination services ship.
What is Leader Election?
The 60-second primer
Leader election is the trick of getting a group of equal machines to pick one of themselves to be in charge — without a referee. Every replicated system needs it. Somebody has to own the write path, advance the schema, run the cron job, hold the lock. The algorithms here are the ways a cluster nominates that somebody, and — far more interestingly — how it re-nominates a replacement when the current leader goes silent.
All five solve the same shape of problem, but they make wildly different deals. Bully and Ring are the textbook algorithms — pure ID comparisons over fully connected or ring-shaped networks. Raft-style uses random timers and per-term majority votes; the ID does not decide the winner, timing and quorum do. Lease-based offloads the entire problem to a single mutable lock with a TTL — whoever holds it leads, whoever stops renewing loses. ZooKeeper's ephemeral-sequential scheme combines lease and watcher semantics into a herd-free, in-tree solution.
Open any production system that needs a single coordinator — etcd, Kubernetes controllers, Kafka KRaft, ZooKeeper, Patroni, Redis Sentinel, HashiCorp Consul, Spark standalone, ClickHouse Keeper — and one of these patterns is under the hood. The choice you make decides how fast failover is, how badly a network partition hurts you, and whether two nodes can ever briefly think they are both leader.
Where this shows up
- Replicated databases — Postgres-HA via Patroni elects a primary; Redis Sentinel/Cluster elects a master per shard; MySQL Group Replication elects a writer.
- Coordination services — etcd and Consul run Raft elections internally; ZooKeeper runs ZAB's Fast Leader Election on ephemeral znodes.
- Cron-style singletons — Kubernetes controllers, Apache Spark drivers, scheduled jobs that must run on exactly one node use lease-based election against the API server.
- Stream processing — Kafka KRaft elects a controller via Raft; partition leaders are elected within the controller's metadata log.
- Service-discovery & locks — "is this the canary?" "who owns this shard?" "who runs the compaction?" — every answer flows out of a leader election somewhere.
- Embedded routing protocols — OSPF designated routers, IS-IS DIS election: a tiny leader election lives inside every link-state network.
Split-brain is the failure to avoid
The whole point of leader election is at most one leader at a time. If two nodes both believe they are leader, both will accept writes and the cluster's invariants break. Each algorithm here has a different mechanism for preventing it — quorum (Raft), monotonic terms, lease expiry, ephemeral session — and understanding which mechanism stops split-brain is the most important thing you can learn about leader election.
Side-by-side
How they compare
The same concepts, on the same axes. Use this as a map; the individual pages are the territory.
| Algorithm | Picks by | Network shape | Failover trigger | Best for |
|---|---|---|---|---|
01Bully | Highest ID wins | Fully connected, every node knows the others | Any node notices leader silent → starts election | Small static clusters; teaching the textbook algorithm. |
02Ring | Highest ID collected on a clockwise pass | Logical ring (each node knows its successor) | Neighbour notices leader silent → starts token | Topologies that are already a ring; protocols with constrained connectivity. |
03Raft-style | Per-term majority vote (timing breaks ties) | Any-to-any; needs 2f+1 to tolerate f failures | Election timer (150–300 ms) fires on a follower | Replicated state machines; the default election scheme of 2026 (etcd, Consul, CockroachDB). |
04Lease-based | First to grab a free lock + keeps renewing | Shared store everyone can talk to (etcd, Redis, DB) | Lease TTL drains to zero with no renewal | Kubernetes-style singleton controllers; piggybacking on an existing strongly-consistent store. |
05ZooKeeper | Smallest sequence number in /election | All nodes connected to a ZK ensemble | Ephemeral znode auto-deletes on session loss | Systems already running ZooKeeper; massive clusters where the no-herd watcher chain matters. |
- Picks by
- Highest ID wins
- Network shape
- Fully connected, every node knows the others
- Failover trigger
Any node notices leader silent → starts election- Best for
- Small static clusters; teaching the textbook algorithm.
- Picks by
- Highest ID collected on a clockwise pass
- Network shape
- Logical ring (each node knows its successor)
- Failover trigger
Neighbour notices leader silent → starts token- Best for
- Topologies that are already a ring; protocols with constrained connectivity.
- Picks by
- Per-term majority vote (timing breaks ties)
- Network shape
- Any-to-any; needs 2f+1 to tolerate f failures
- Failover trigger
Election timer (150–300 ms) fires on a follower- Best for
- Replicated state machines; the default election scheme of 2026 (etcd, Consul, CockroachDB).
- Picks by
- First to grab a free lock + keeps renewing
- Network shape
- Shared store everyone can talk to (etcd, Redis, DB)
- Failover trigger
Lease TTL drains to zero with no renewal- Best for
- Kubernetes-style singleton controllers; piggybacking on an existing strongly-consistent store.
- Picks by
- Smallest sequence number in /election
- Network shape
- All nodes connected to a ZK ensemble
- Failover trigger
Ephemeral znode auto-deletes on session loss- Best for
- Systems already running ZooKeeper; massive clusters where the no-herd watcher chain matters.
Decision guide
Which one should you use?
A practical tour of when each algorithm wins.
How to pick
- You already run a Raft/etcd/Consul cluster? Use lease-based election against it. You get a battle-tested store, a fence token (the lease revision), and zero new infrastructure. This is the Kubernetes pattern.
- You're building a replicated state machine from scratch? Use Raft-style election as part of the consensus protocol — it pairs naturally with the log-replication state and gets you the term number you need anyway.
- You already run ZooKeeper? Use the ephemeral-sequential znode recipe. Watching only your predecessor is the textbook way to avoid an election storm when one node dies.
- Small static cluster, simple shape, no shared store? Bully or Ring still work fine for tens of nodes and are easy to implement. Just be honest about the O(n²) or O(n) message complexity at failover.
- You want guaranteed at-most-one leader under any network partition? Pick the algorithm whose safety rests on a quorum (Raft) or an external lock with a fence token (lease), not on timing alone — the textbook Bully and Ring algorithms can elect two leaders during certain partitions.
Fence tokens, not heartbeats
Heartbeats tell you the old leader was probably alive recently. They don't stop it from doing damage right now if it's just been slow. The defence is a monotonically increasing fence token issued at election time — every write the leader does carries the token, and the store rejects writes with stale tokens. Raft uses the term; lease-based uses the lease revision; ZooKeeper uses the zxid. Without a fence token, a paused-then-resumed old leader can still corrupt state after the new one has been elected. Always design for that case.
Concepts in this track
5 concepts, in order
Each links to a concept page with its own explanation, prototype, and quiz.
Bully Algorithm
Whoever has the highest ID wins. When the leader dies, anyone can call an election — and higher IDs bully lower IDs out of contention.
Ring (Chang–Roberts)
Pass a token clockwise around a ring, each node appending its ID. Whoever's ID is highest when the token returns is the leader.
Raft-Style Election
Random election timers + per-term majority votes. The first follower to time out runs for election; whoever gets a quorum leads the term.
Lease-Based Election
A single TTL-bounded lock in a shared store. Hold it and you're leader; stop renewing and someone else takes over when it expires.
ZooKeeper Election
Every node creates an ephemeral sequential znode; the smallest sequence is leader. Each follower watches only its predecessor — no herd effect.
Related tracks
If this one clicks, try these next.
Consistent Hashing
Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.
Circuit Breaker
When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.
Consensus
How a cluster agrees on a single answer when nodes die, packets drop, and some machines may even lie. Seven algorithms, from the two-phase commit that everyone learns first to the Byzantine-fault-tolerant PBFT.