Advanced

Consensus

How a cluster agrees on a single answer when nodes die, packets drop, and some machines may even lie. Seven algorithms, from the two-phase commit that everyone learns first to the Byzantine-fault-tolerant PBFT.

distributedconsistencypatterns

Start with Two-Phase Commit (2PC)

What is Consensus?

The 60-second primer

Consensus is the problem of getting a bunch of machines to agree on a single value, in the same order, even when some of them die or lie. It sounds modest until you try to do it. Networks lose packets, clocks drift, processes crash mid-decision, and the same byte can arrive twice or never. A consensus algorithm is a recipe for surviving every one of those failure modes while still producing one answer that everybody — both survivors and recovered nodes — agrees on.

All the algorithms here solve the same core problem, but they make different deals. 2PC is the textbook atomic-commit protocol — simple and blocking. 3PC patches the worst blocking case by adding a phase. Paxos is the foundational provably-correct algorithm, and Lamport's notes still inform everything that came after. Multi-Paxos, Raft, and ZAB are the production variants — they elect a stable leader and stream commands cheaply. And PBFT raises the difficulty: instead of crashes, what if some nodes actively lie?

If a database survives partitions, a key-value store stays strongly consistent across replicas, or a blockchain orders transactions deterministically — under the hood it is running one of these. Etcd, Consul, CockroachDB, Spanner, TiKV, FoundationDB, MongoDB's replica sets, Kafka's controller quorum, ZooKeeper, Hyperledger, every modern PoS chain — pick one and you can name the algorithm.

Where this shows up

Replicated key-value stores — etcd, Consul, ZooKeeper all run a leader-based consensus (Raft, Raft, ZAB respectively) to keep N copies of the same map in the same order.
Distributed databases — Spanner, CockroachDB, YugabyteDB, TiDB shard data into Raft groups so each shard's replicas always agree on the write order.
Coordination services — leader election for cron jobs, distributed locks, service discovery — all need one answer the rest of the cluster can trust.
Atomic commit across services — 2PC still ships in XA transactions, distributed SQL, microservice sagas (as a building block), and any "all participants commit or all abort" workflow.
Kafka & message brokers — KRaft replaces ZooKeeper with an internal Raft for the metadata log; the controller quorum decides partition leadership.
Blockchains — PBFT and its descendants (Tendermint, HotStuff, IBFT, Hyperledger Fabric's BFT order service) power permissioned and PoS chains where some validators might be hostile, not just crashed.

FLP — and why we still solve it anyway

The Fischer–Lynch–Paterson impossibility (1985) proves no deterministic asynchronous algorithm can guarantee consensus with even one faulty process. Real protocols dodge it with partial synchrony (assume timeouts eventually work) and randomized timers — that's why Raft randomizes election timeouts and Paxos retries with higher ballot numbers. The impossibility is real; we work around it by relaxing the model, not by ignoring it.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

Algorithm	Fault model	Quorum	Steady-state RTTs	Best for
012PC	Coordinator + participants	All-of-N	`2 RTTs (vote + decide)`	Atomic commit across known participants, brief, low-stakes blocking acceptable.
023PC	Crash failures	All-of-N	`3 RTTs`	Atomic commit that must not block on coordinator failure — synchronous network only.
03Paxos	Crash failures	Majority (f+1 of 2f+1)	`2 RTTs per value`	Single-decree consensus, theoretical foundation, the algorithm every other modern one descends from.
04Multi-Paxos	Crash failures	Majority	`1 RTT (after election)`	High-throughput replicated logs — Spanner, Chubby, Megastore.
05Raft	Crash failures	Majority	`1 RTT (after election)`	Same job as Multi-Paxos, easier to implement and reason about — etcd, Consul, CockroachDB.
06ZAB	Crash failures	Majority	`1 RTT (broadcast)`	Total-order atomic broadcast for a coordination service — ZooKeeper.
07PBFT	Byzantine (lying) failures	2f+1 of 3f+1	`3 all-to-all rounds`	Permissioned blockchains, BFT replication where nodes may actively misbehave.

012PC

Fault model: Coordinator + participants
Quorum: All-of-N
Steady-state RTTs: 2 RTTs (vote + decide)
Best for: Atomic commit across known participants, brief, low-stakes blocking acceptable.

023PC

Fault model: Crash failures
Quorum: All-of-N
Steady-state RTTs: 3 RTTs
Best for: Atomic commit that must not block on coordinator failure — synchronous network only.

03Paxos

Fault model: Crash failures
Quorum: Majority (f+1 of 2f+1)
Steady-state RTTs: 2 RTTs per value
Best for: Single-decree consensus, theoretical foundation, the algorithm every other modern one descends from.

04Multi-Paxos

Fault model: Crash failures
Quorum: Majority
Steady-state RTTs: 1 RTT (after election)
Best for: High-throughput replicated logs — Spanner, Chubby, Megastore.

05Raft

Fault model: Crash failures
Quorum: Majority
Steady-state RTTs: 1 RTT (after election)
Best for: Same job as Multi-Paxos, easier to implement and reason about — etcd, Consul, CockroachDB.

06ZAB

Fault model: Crash failures
Quorum: Majority
Steady-state RTTs: 1 RTT (broadcast)
Best for: Total-order atomic broadcast for a coordination service — ZooKeeper.

07PBFT

Fault model: Byzantine (lying) failures
Quorum: 2f+1 of 3f+1
Steady-state RTTs: 3 all-to-all rounds
Best for: Permissioned blockchains, BFT replication where nodes may actively misbehave.

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

How to pick

Single transaction across services? Reach for 2PC if you can tolerate a brief block on coordinator failure, or pair it with a saga for compensating actions. Don't reach for Paxos here — it solves a different shape of problem.
You need atomic commit but cannot tolerate the 2PC blocking case? 3PC in a synchronous network, or — more often in 2026 — replicate the coordinator itself with Raft so its decision survives a crash.
Replicated log / state machine, crash failures only? Raft. It's the default in 2026: well-understood, libraries everywhere, clear leader semantics. Reach for Multi-Paxos only if you already speak it fluently or interoperate with a Paxos system.
Building a coordination service or ordered broadcast? ZAB is the in-tree choice for ZooKeeper; for anything new, you'd still pick Raft.
Some nodes might be malicious, not just slow? PBFT or one of its descendants (Tendermint, HotStuff, IBFT). The cost: O(n²) messages and 3f+1 nodes to tolerate f liars.

Crash-fault is almost always enough

Inside one trust boundary — your own data centre, your own replicas — assume crash failures and pick Raft. Byzantine tolerance is for crossing trust boundaries (open consortia, permissioned blockchains) where you literally don't trust the other nodes. Don't pay the BFT message-complexity tax to defend against your own servers.

Concepts in this track

7 concepts, in order

Each links to a concept page with its own explanation, prototype, and quiz.

Two-Phase Commit (2PC)

A coordinator polls everyone, then tells them all the same answer. Atomic — but if the coordinator dies mid-decision, the cluster hangs.

Beginner11 mintry it

Three-Phase Commit (3PC)

Slip a pre-commit phase between vote and commit so survivors can recover the decision without the coordinator. Non-blocking — at the cost of an extra round.

Intermediate12 mintry it

Paxos

Two rounds — Prepare and Accept — driven by ever-increasing proposal numbers, and the impossible-to-contradict invariant they enforce on a quorum.

Advanced14 mintry it

Multi-Paxos

Pay for Prepare once, elect a stable leader, then stream commands with Accept alone. The shape every modern replication protocol ends up copying.

Advanced13 mintry it

Raft

Paxos rewritten so humans can implement it: terms, an elected leader, a strictly append-only log, and AppendEntries as the only replication RPC.

Advanced14 mintry it

ZAB — ZooKeeper Atomic Broadcast

ZooKeeper's variant. Elect by highest zxid, open a new epoch, sync followers up to the leader, then propose/ACK/commit in strict FIFO order.

Advanced12 mintry it

PBFT — Practical Byzantine Fault Tolerance

Pre-Prepare, Prepare, Commit — three rounds of cross-checked broadcasts so 3f+1 nodes can agree even when f of them are actively lying.

Advanced14 mintry it

Related tracks

If this one clicks, try these next.

Consistent Hashing

Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.

Intermediate5 concepts · 55 min

distributedscalingpatterns

Circuit Breaker

When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.

Intermediate6 concepts · 70 min

distributedresiliencepatterns

Leader Election

How a cluster picks one node to be in charge — and how it picks again when that node falls over. Five algorithms, from the textbook ID-based shouting matches to the lease-and-watcher schemes real coordination services ship.

Intermediate5 concepts · 60 min

distributedcoordinationpatterns