Intermediate

Leader Election

How a cluster picks one node to be in charge — and how it picks again when that node falls over. Five algorithms, from the textbook ID-based shouting matches to the lease-and-watcher schemes real coordination services ship.

distributedcoordinationpatterns

What is Leader Election?

The 60-second primer

Leader election is the trick of getting a group of equal machines to pick one of themselves to be in charge — without a referee. Every replicated system needs it. Somebody has to own the write path, advance the schema, run the cron job, hold the lock. The algorithms here are the ways a cluster nominates that somebody, and — far more interestingly — how it re-nominates a replacement when the current leader goes silent.

All five solve the same shape of problem, but they make wildly different deals. Bully and Ring are the textbook algorithms — pure ID comparisons over fully connected or ring-shaped networks. Raft-style uses random timers and per-term majority votes; the ID does not decide the winner, timing and quorum do. Lease-based offloads the entire problem to a single mutable lock with a TTL — whoever holds it leads, whoever stops renewing loses. ZooKeeper's ephemeral-sequential scheme combines lease and watcher semantics into a herd-free, in-tree solution.

Open any production system that needs a single coordinator — etcd, Kubernetes controllers, Kafka KRaft, ZooKeeper, Patroni, Redis Sentinel, HashiCorp Consul, Spark standalone, ClickHouse Keeper — and one of these patterns is under the hood. The choice you make decides how fast failover is, how badly a network partition hurts you, and whether two nodes can ever briefly think they are both leader.

Where this shows up

  • Replicated databases — Postgres-HA via Patroni elects a primary; Redis Sentinel/Cluster elects a master per shard; MySQL Group Replication elects a writer.
  • Coordination services — etcd and Consul run Raft elections internally; ZooKeeper runs ZAB's Fast Leader Election on ephemeral znodes.
  • Cron-style singletons — Kubernetes controllers, Apache Spark drivers, scheduled jobs that must run on exactly one node use lease-based election against the API server.
  • Stream processing — Kafka KRaft elects a controller via Raft; partition leaders are elected within the controller's metadata log.
  • Service-discovery & locks — "is this the canary?" "who owns this shard?" "who runs the compaction?" — every answer flows out of a leader election somewhere.
  • Embedded routing protocols — OSPF designated routers, IS-IS DIS election: a tiny leader election lives inside every link-state network.

Split-brain is the failure to avoid

The whole point of leader election is at most one leader at a time. If two nodes both believe they are leader, both will accept writes and the cluster's invariants break. Each algorithm here has a different mechanism for preventing it — quorum (Raft), monotonic terms, lease expiry, ephemeral session — and understanding which mechanism stops split-brain is the most important thing you can learn about leader election.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

01Bully
Picks by
Highest ID wins
Network shape
Fully connected, every node knows the others
Failover trigger
Any node notices leader silent → starts election
Best for
Small static clusters; teaching the textbook algorithm.
02Ring
Picks by
Highest ID collected on a clockwise pass
Network shape
Logical ring (each node knows its successor)
Failover trigger
Neighbour notices leader silent → starts token
Best for
Topologies that are already a ring; protocols with constrained connectivity.
03Raft-style
Picks by
Per-term majority vote (timing breaks ties)
Network shape
Any-to-any; needs 2f+1 to tolerate f failures
Failover trigger
Election timer (150–300 ms) fires on a follower
Best for
Replicated state machines; the default election scheme of 2026 (etcd, Consul, CockroachDB).
04Lease-based
Picks by
First to grab a free lock + keeps renewing
Network shape
Shared store everyone can talk to (etcd, Redis, DB)
Failover trigger
Lease TTL drains to zero with no renewal
Best for
Kubernetes-style singleton controllers; piggybacking on an existing strongly-consistent store.
05ZooKeeper
Picks by
Smallest sequence number in /election
Network shape
All nodes connected to a ZK ensemble
Failover trigger
Ephemeral znode auto-deletes on session loss
Best for
Systems already running ZooKeeper; massive clusters where the no-herd watcher chain matters.

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

How to pick

  • You already run a Raft/etcd/Consul cluster? Use lease-based election against it. You get a battle-tested store, a fence token (the lease revision), and zero new infrastructure. This is the Kubernetes pattern.
  • You're building a replicated state machine from scratch? Use Raft-style election as part of the consensus protocol — it pairs naturally with the log-replication state and gets you the term number you need anyway.
  • You already run ZooKeeper? Use the ephemeral-sequential znode recipe. Watching only your predecessor is the textbook way to avoid an election storm when one node dies.
  • Small static cluster, simple shape, no shared store? Bully or Ring still work fine for tens of nodes and are easy to implement. Just be honest about the O(n²) or O(n) message complexity at failover.
  • You want guaranteed at-most-one leader under any network partition? Pick the algorithm whose safety rests on a quorum (Raft) or an external lock with a fence token (lease), not on timing alone — the textbook Bully and Ring algorithms can elect two leaders during certain partitions.

Fence tokens, not heartbeats

Heartbeats tell you the old leader was probably alive recently. They don't stop it from doing damage right now if it's just been slow. The defence is a monotonically increasing fence token issued at election time — every write the leader does carries the token, and the store rejects writes with stale tokens. Raft uses the term; lease-based uses the lease revision; ZooKeeper uses the zxid. Without a fence token, a paused-then-resumed old leader can still corrupt state after the new one has been elected. Always design for that case.

Related tracks

If this one clicks, try these next.