Intermediate

Leader Election

How a cluster picks one node to be in charge — and how it picks again when that node falls over. Five algorithms, from the textbook ID-based shouting matches to the lease-and-watcher schemes real coordination services ship.

distributedcoordinationpatterns

Start with Bully Algorithm

What is Leader Election?

The 60-second primer

Leader election is the trick of getting a group of equal machines to pick one of themselves to be in charge — without a referee. Every replicated system needs it. Somebody has to own the write path, advance the schema, run the cron job, hold the lock. The algorithms here are the ways a cluster nominates that somebody, and — far more interestingly — how it re-nominates a replacement when the current leader goes silent.

All five solve the same shape of problem, but they make wildly different deals. Bully and Ring are the textbook algorithms — pure ID comparisons over fully connected or ring-shaped networks. Raft-style uses random timers and per-term majority votes; the ID does not decide the winner, timing and quorum do. Lease-based offloads the entire problem to a single mutable lock with a TTL — whoever holds it leads, whoever stops renewing loses. ZooKeeper's ephemeral-sequential scheme combines lease and watcher semantics into a herd-free, in-tree solution.

Open any production system that needs a single coordinator — etcd, Kubernetes controllers, Kafka KRaft, ZooKeeper, Patroni, Redis Sentinel, HashiCorp Consul, Spark standalone, ClickHouse Keeper — and one of these patterns is under the hood. The choice you make decides how fast failover is, how badly a network partition hurts you, and whether two nodes can ever briefly think they are both leader.

Where this shows up

Replicated databases — Postgres-HA via Patroni elects a primary; Redis Sentinel/Cluster elects a master per shard; MySQL Group Replication elects a writer.
Coordination services — etcd and Consul run Raft elections internally; ZooKeeper runs ZAB's Fast Leader Election on ephemeral znodes.
Cron-style singletons — Kubernetes controllers, Apache Spark drivers, scheduled jobs that must run on exactly one node use lease-based election against the API server.
Stream processing — Kafka KRaft elects a controller via Raft; partition leaders are elected within the controller's metadata log.
Service-discovery & locks — "is this the canary?" "who owns this shard?" "who runs the compaction?" — every answer flows out of a leader election somewhere.
Embedded routing protocols — OSPF designated routers, IS-IS DIS election: a tiny leader election lives inside every link-state network.

Split-brain is the failure to avoid

The whole point of leader election is at most one leader at a time. If two nodes both believe they are leader, both will accept writes and the cluster's invariants break. Each algorithm here has a different mechanism for preventing it — quorum (Raft), monotonic terms, lease expiry, ephemeral session — and understanding which mechanism stops split-brain is the most important thing you can learn about leader election.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

Algorithm	Picks by	Network shape	Failover trigger	Best for
01Bully	Highest ID wins	Fully connected, every node knows the others	`Any node notices leader silent → starts election`	Small static clusters; teaching the textbook algorithm.
02Ring	Highest ID collected on a clockwise pass	Logical ring (each node knows its successor)	`Neighbour notices leader silent → starts token`	Topologies that are already a ring; protocols with constrained connectivity.
03Raft-style	Per-term majority vote (timing breaks ties)	Any-to-any; needs 2f+1 to tolerate f failures	`Election timer (150–300 ms) fires on a follower`	Replicated state machines; the default election scheme of 2026 (etcd, Consul, CockroachDB).
04Lease-based	First to grab a free lock + keeps renewing	Shared store everyone can talk to (etcd, Redis, DB)	`Lease TTL drains to zero with no renewal`	Kubernetes-style singleton controllers; piggybacking on an existing strongly-consistent store.
05ZooKeeper	Smallest sequence number in /election	All nodes connected to a ZK ensemble	`Ephemeral znode auto-deletes on session loss`	Systems already running ZooKeeper; massive clusters where the no-herd watcher chain matters.

01Bully

Picks by: Highest ID wins
Network shape: Fully connected, every node knows the others
Failover trigger: Any node notices leader silent → starts election
Best for: Small static clusters; teaching the textbook algorithm.

02Ring

Picks by: Highest ID collected on a clockwise pass
Network shape: Logical ring (each node knows its successor)
Failover trigger: Neighbour notices leader silent → starts token
Best for: Topologies that are already a ring; protocols with constrained connectivity.

03Raft-style

Picks by: Per-term majority vote (timing breaks ties)
Network shape: Any-to-any; needs 2f+1 to tolerate f failures
Failover trigger: Election timer (150–300 ms) fires on a follower
Best for: Replicated state machines; the default election scheme of 2026 (etcd, Consul, CockroachDB).

04Lease-based

Picks by: First to grab a free lock + keeps renewing
Network shape: Shared store everyone can talk to (etcd, Redis, DB)
Failover trigger: Lease TTL drains to zero with no renewal
Best for: Kubernetes-style singleton controllers; piggybacking on an existing strongly-consistent store.

05ZooKeeper

Picks by: Smallest sequence number in /election
Network shape: All nodes connected to a ZK ensemble
Failover trigger: Ephemeral znode auto-deletes on session loss
Best for: Systems already running ZooKeeper; massive clusters where the no-herd watcher chain matters.

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

How to pick

You already run a Raft/etcd/Consul cluster? Use lease-based election against it. You get a battle-tested store, a fence token (the lease revision), and zero new infrastructure. This is the Kubernetes pattern.
You're building a replicated state machine from scratch? Use Raft-style election as part of the consensus protocol — it pairs naturally with the log-replication state and gets you the term number you need anyway.
You already run ZooKeeper? Use the ephemeral-sequential znode recipe. Watching only your predecessor is the textbook way to avoid an election storm when one node dies.
Small static cluster, simple shape, no shared store? Bully or Ring still work fine for tens of nodes and are easy to implement. Just be honest about the O(n²) or O(n) message complexity at failover.
You want guaranteed at-most-one leader under any network partition? Pick the algorithm whose safety rests on a quorum (Raft) or an external lock with a fence token (lease), not on timing alone — the textbook Bully and Ring algorithms can elect two leaders during certain partitions.

Fence tokens, not heartbeats

Heartbeats tell you the old leader was probably alive recently. They don't stop it from doing damage right now if it's just been slow. The defence is a monotonically increasing fence token issued at election time — every write the leader does carries the token, and the store rejects writes with stale tokens. Raft uses the term; lease-based uses the lease revision; ZooKeeper uses the zxid. Without a fence token, a paused-then-resumed old leader can still corrupt state after the new one has been elected. Always design for that case.

Concepts in this track

5 concepts, in order

Each links to a concept page with its own explanation, prototype, and quiz.

Bully Algorithm

Whoever has the highest ID wins. When the leader dies, anyone can call an election — and higher IDs bully lower IDs out of contention.

Beginner10 mintry it

Ring (Chang–Roberts)

Pass a token clockwise around a ring, each node appending its ID. Whoever's ID is highest when the token returns is the leader.

Beginner10 mintry it

Raft-Style Election

Random election timers + per-term majority votes. The first follower to time out runs for election; whoever gets a quorum leads the term.

Intermediate12 mintry it

Lease-Based Election

A single TTL-bounded lock in a shared store. Hold it and you're leader; stop renewing and someone else takes over when it expires.

Intermediate11 mintry it

ZooKeeper Election

Every node creates an ephemeral sequential znode; the smallest sequence is leader. Each follower watches only its predecessor — no herd effect.

Advanced12 mintry it

Related tracks

If this one clicks, try these next.

Consistent Hashing

Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.

Intermediate5 concepts · 55 min

distributedscalingpatterns

Circuit Breaker

When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.

Intermediate6 concepts · 70 min

distributedresiliencepatterns

Consensus

How a cluster agrees on a single answer when nodes die, packets drop, and some machines may even lie. Seven algorithms, from the two-phase commit that everyone learns first to the Byzantine-fault-tolerant PBFT.

Advanced7 concepts · 90 min

distributedconsistencypatterns