Consensus
How a cluster agrees on a single answer when nodes die, packets drop, and some machines may even lie. Seven algorithms, from the two-phase commit that everyone learns first to the Byzantine-fault-tolerant PBFT.
What is Consensus?
The 60-second primer
Consensus is the problem of getting a bunch of machines to agree on a single value, in the same order, even when some of them die or lie. It sounds modest until you try to do it. Networks lose packets, clocks drift, processes crash mid-decision, and the same byte can arrive twice or never. A consensus algorithm is a recipe for surviving every one of those failure modes while still producing one answer that everybody — both survivors and recovered nodes — agrees on.
All the algorithms here solve the same core problem, but they make different deals. 2PC is the textbook atomic-commit protocol — simple and blocking. 3PC patches the worst blocking case by adding a phase. Paxos is the foundational provably-correct algorithm, and Lamport's notes still inform everything that came after. Multi-Paxos, Raft, and ZAB are the production variants — they elect a stable leader and stream commands cheaply. And PBFT raises the difficulty: instead of crashes, what if some nodes actively lie?
If a database survives partitions, a key-value store stays strongly consistent across replicas, or a blockchain orders transactions deterministically — under the hood it is running one of these. Etcd, Consul, CockroachDB, Spanner, TiKV, FoundationDB, MongoDB's replica sets, Kafka's controller quorum, ZooKeeper, Hyperledger, every modern PoS chain — pick one and you can name the algorithm.
Where this shows up
- Replicated key-value stores — etcd, Consul, ZooKeeper all run a leader-based consensus (Raft, Raft, ZAB respectively) to keep N copies of the same map in the same order.
- Distributed databases — Spanner, CockroachDB, YugabyteDB, TiDB shard data into Raft groups so each shard's replicas always agree on the write order.
- Coordination services — leader election for cron jobs, distributed locks, service discovery — all need one answer the rest of the cluster can trust.
- Atomic commit across services — 2PC still ships in XA transactions, distributed SQL, microservice sagas (as a building block), and any "all participants commit or all abort" workflow.
- Kafka & message brokers — KRaft replaces ZooKeeper with an internal Raft for the metadata log; the controller quorum decides partition leadership.
- Blockchains — PBFT and its descendants (Tendermint, HotStuff, IBFT, Hyperledger Fabric's BFT order service) power permissioned and PoS chains where some validators might be hostile, not just crashed.
FLP — and why we still solve it anyway
The Fischer–Lynch–Paterson impossibility (1985) proves no deterministic asynchronous algorithm can guarantee consensus with even one faulty process. Real protocols dodge it with partial synchrony (assume timeouts eventually work) and randomized timers — that's why Raft randomizes election timeouts and Paxos retries with higher ballot numbers. The impossibility is real; we work around it by relaxing the model, not by ignoring it.
Side-by-side
How they compare
The same concepts, on the same axes. Use this as a map; the individual pages are the territory.
| Algorithm | Fault model | Quorum | Steady-state RTTs | Best for |
|---|---|---|---|---|
012PC | Coordinator + participants | All-of-N | 2 RTTs (vote + decide) | Atomic commit across known participants, brief, low-stakes blocking acceptable. |
023PC | Crash failures | All-of-N | 3 RTTs | Atomic commit that must not block on coordinator failure — synchronous network only. |
03Paxos | Crash failures | Majority (f+1 of 2f+1) | 2 RTTs per value | Single-decree consensus, theoretical foundation, the algorithm every other modern one descends from. |
04Multi-Paxos | Crash failures | Majority | 1 RTT (after election) | High-throughput replicated logs — Spanner, Chubby, Megastore. |
05Raft | Crash failures | Majority | 1 RTT (after election) | Same job as Multi-Paxos, easier to implement and reason about — etcd, Consul, CockroachDB. |
06ZAB | Crash failures | Majority | 1 RTT (broadcast) | Total-order atomic broadcast for a coordination service — ZooKeeper. |
07PBFT | Byzantine (lying) failures | 2f+1 of 3f+1 | 3 all-to-all rounds | Permissioned blockchains, BFT replication where nodes may actively misbehave. |
- Fault model
- Coordinator + participants
- Quorum
- All-of-N
- Steady-state RTTs
2 RTTs (vote + decide)- Best for
- Atomic commit across known participants, brief, low-stakes blocking acceptable.
- Fault model
- Crash failures
- Quorum
- All-of-N
- Steady-state RTTs
3 RTTs- Best for
- Atomic commit that must not block on coordinator failure — synchronous network only.
- Fault model
- Crash failures
- Quorum
- Majority (f+1 of 2f+1)
- Steady-state RTTs
2 RTTs per value- Best for
- Single-decree consensus, theoretical foundation, the algorithm every other modern one descends from.
- Fault model
- Crash failures
- Quorum
- Majority
- Steady-state RTTs
1 RTT (after election)- Best for
- High-throughput replicated logs — Spanner, Chubby, Megastore.
- Fault model
- Crash failures
- Quorum
- Majority
- Steady-state RTTs
1 RTT (after election)- Best for
- Same job as Multi-Paxos, easier to implement and reason about — etcd, Consul, CockroachDB.
- Fault model
- Crash failures
- Quorum
- Majority
- Steady-state RTTs
1 RTT (broadcast)- Best for
- Total-order atomic broadcast for a coordination service — ZooKeeper.
- Fault model
- Byzantine (lying) failures
- Quorum
- 2f+1 of 3f+1
- Steady-state RTTs
3 all-to-all rounds- Best for
- Permissioned blockchains, BFT replication where nodes may actively misbehave.
Decision guide
Which one should you use?
A practical tour of when each algorithm wins.
How to pick
- Single transaction across services? Reach for 2PC if you can tolerate a brief block on coordinator failure, or pair it with a saga for compensating actions. Don't reach for Paxos here — it solves a different shape of problem.
- You need atomic commit but cannot tolerate the 2PC blocking case? 3PC in a synchronous network, or — more often in 2026 — replicate the coordinator itself with Raft so its decision survives a crash.
- Replicated log / state machine, crash failures only? Raft. It's the default in 2026: well-understood, libraries everywhere, clear leader semantics. Reach for Multi-Paxos only if you already speak it fluently or interoperate with a Paxos system.
- Building a coordination service or ordered broadcast? ZAB is the in-tree choice for ZooKeeper; for anything new, you'd still pick Raft.
- Some nodes might be malicious, not just slow? PBFT or one of its descendants (Tendermint, HotStuff, IBFT). The cost: O(n²) messages and 3f+1 nodes to tolerate f liars.
Crash-fault is almost always enough
Inside one trust boundary — your own data centre, your own replicas — assume crash failures and pick Raft. Byzantine tolerance is for crossing trust boundaries (open consortia, permissioned blockchains) where you literally don't trust the other nodes. Don't pay the BFT message-complexity tax to defend against your own servers.
Concepts in this track
7 concepts, in order
Each links to a concept page with its own explanation, prototype, and quiz.
Two-Phase Commit (2PC)
A coordinator polls everyone, then tells them all the same answer. Atomic — but if the coordinator dies mid-decision, the cluster hangs.
Three-Phase Commit (3PC)
Slip a pre-commit phase between vote and commit so survivors can recover the decision without the coordinator. Non-blocking — at the cost of an extra round.
Paxos
Two rounds — Prepare and Accept — driven by ever-increasing proposal numbers, and the impossible-to-contradict invariant they enforce on a quorum.
Multi-Paxos
Pay for Prepare once, elect a stable leader, then stream commands with Accept alone. The shape every modern replication protocol ends up copying.
Raft
Paxos rewritten so humans can implement it: terms, an elected leader, a strictly append-only log, and AppendEntries as the only replication RPC.
ZAB — ZooKeeper Atomic Broadcast
ZooKeeper's variant. Elect by highest zxid, open a new epoch, sync followers up to the leader, then propose/ACK/commit in strict FIFO order.
PBFT — Practical Byzantine Fault Tolerance
Pre-Prepare, Prepare, Commit — three rounds of cross-checked broadcasts so 3f+1 nodes can agree even when f of them are actively lying.
Related tracks
If this one clicks, try these next.
Consistent Hashing
Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.
Circuit Breaker
When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.
Rate Limiting
Control request throughput so a noisy client cannot starve everyone else. Compare the five canonical algorithms side-by-side.