Overview
What this concept solves
Raft is Multi-Paxos rewritten so humans can understand it. Diego Ongaro and John Ousterhout published it in 2014 ("In Search of an Understandable Consensus Algorithm") with one explicit design goal: make the protocol teachable. They split consensus into three independent sub-problems — leader election, log replication, and safety — and built each one so its rules can be stated in a couple of sentences. The result became the default consensus algorithm of the 2020s.
Every server is in one of three states: Follower (passive, takes orders), Candidate (running for leader), or Leader (the one issuing commands). Time is divided into terms — a monotonically increasing integer. Each term has at most one leader. Followers know about the leader by receiving heartbeats; if heartbeats stop arriving, a follower's election timer fires and it becomes a candidate, increments the term, and asks for votes.
Once elected, the leader replicates client commands by appending them to its log and sending AppendEntries to every follower. When a majority has stored an entry, the leader marks it committed and applies it to the state machine. Followers learn about the commit via the next AppendEntries. That's it. The whole protocol — election + replication + a small safety rule on which logs are allowed to win elections — fits on a single diagram. That's why etcd, Consul, CockroachDB, TiDB, Kafka's KRaft, and dozens of other systems all ship it.
Mechanics
How it works
Leader election
- Followers expect a heartbeat (an empty AppendEntries from the leader) every ~50ms. If their election timer (randomized between 150–300ms) fires first, they become a Candidate.
- Candidate: increment term, vote for self, send
RequestVote(term, lastLogIndex, lastLogTerm)to every other server. - Each server votes YES at most once per term, and only for a candidate whose log is at least as up-to-date as its own. The up-to-date test is: later term wins; same term, longer log wins.
- If the candidate gets votes from a majority, it becomes leader and starts sending heartbeats. If it sees a higher term or another leader, it reverts to follower. If no winner (split vote), election times out and a new term begins with randomized backoff.
Log replication
- Client sends a command to the leader. Leader appends
(term, index, command)to its log. - Leader sends
AppendEntriesto every follower, carrying the new entry plusprevLogIndex/prevLogTermfor consistency check. - Follower accepts only if its log matches up to
prevLogIndex/prevLogTerm. On mismatch, the leader decrementsnextIndexfor that follower and retries — that's how out-of-date followers catch up. - When the leader sees the entry stored on a majority, it advances
commitIndexand applies the entry to its state machine. Followers learn the newcommitIndexfrom the next AppendEntries and apply too.
The two safety rules in one paragraph
Election restriction: a server only grants its vote to a candidate whose log is at least as up-to-date as its own. Leader completeness: if an entry is committed in term T, every leader of any term ≥ T has that entry in its log. Together these guarantee that committed entries are never lost, even across arbitrary leader changes.
The one tricky case — committing entries from prior terms
A new leader cannot directly commit an entry from a previous term just because it's replicated on a majority — doing so could overwrite a committed entry in a corner case (Figure 8 in the paper). It must wait until an entry of its own current term is committed; that implicitly commits earlier entries too. This is the subtlety that makes Raft non-trivial and is the lesson new implementations most often get wrong.
Interactive prototype
Run it. Break it. Tune it.
Sandboxed simulation embedded right in the page. No setup, no install.
About this simulation
Five servers running Raft. Pick a scenario — Leader election (a follower times out, requests votes, becomes leader), Replicate a command (one RTT of AppendEntries), or Crash the leader to see a new term begin. Free play queues commands and lets you crash whenever; the log card below holds only the last two messages.
Hands-on
Try these on your own
Open the prototype above, run each experiment, predict the answer, then verify.
Watch a leader get elected
Open Leader election and step through. A follower's election timer fires, it becomes a candidate, sends RequestVote, collects a majority (3 of 5), and becomes leader. Notice that the term advances by one — every election uses a fresh term so old leaders can't accidentally rejoin and confuse the cluster.
Replicate a command in one RTT
Switch to Replicate a command (or continue after election). Click a command button. The leader appends it (amber = tentative), broadcasts AppendEntries, collects acks from a majority, and commits (green = permanent). Followers commit on the next heartbeat. One round-trip per command — same hot path as Multi-Paxos.
Crash the leader
Run Crash the leader. Heartbeats stop, a follower's timer fires, a new election begins with the next term. Committed entries survive (a majority still holds them) and the new leader picks them up via the up-to-date-log rule. Notice that the cluster is briefly unavailable for writes — bounded but real.
Free play — break it yourself
Open Free play. Try crashing the leader mid-replication (between Append and Committed). Try queueing four commands fast. Notice that the leader-completeness rule means a follower with a stale log will never win an election — only candidates whose log is at least as up-to-date as the voter's get the vote. That single restriction is the entire safety story.
In practice
When to use it — and what you give up
When it's the right tool
- Building a replicated state machine — key-value store, configuration registry, distributed lock service. Raft is the default in 2026.
- Replicated logs of any kind — event sourcing, write-ahead logs across replicas, leader-based event streaming.
- Teams that need to onboard new engineers quickly — Raft's three-part decomposition is what makes it the friendly choice over Paxos.
- Heterogeneous environments — etcd's Raft library, Hashicorp's Raft (Go), Tikv (Rust), Apache Ratis (Java) — pick a language, you'll find a maintained Raft.
When to reach for something else
- Byzantine fault model — Raft assumes crash failures only. Use PBFT or HotStuff.
- Leaderless / multi-master writes — Raft funnels everything through the leader. EPaxos, Mencius, Generalized Paxos remove that bottleneck.
- Atomic commit across services — that's 2PC (or a Raft-replicated coordinator), not pure Raft.
- Extreme wide-area latency — Raft is bandwidth- and RTT-sensitive across the leader; geo-replication with witnesses or quorum tuning may suit better.
Pros
- Designed for understandability — three sub-problems, two safety rules. The 2014 paper is one of the most readable algorithms papers of its era.
- Strong-leader simplicity — only the leader appends to the log, only the leader talks to clients. Easy to reason about.
- Same fault tolerance as Paxos — f failures of 2f+1 nodes.
- Mature library ecosystem — production-grade implementations in Go, Rust, Java, C++, Python, Erlang.
- One round-trip per command in steady state — same hot path as Multi-Paxos.
Cons
- Leader is a write bottleneck — every write funnels through it; throughput caps at one node's resources.
- Brief unavailability on leader failure — bounded by the election timeout (typically 100–500ms), but real clients see timeouts.
- Strong-leader assumption doesn't fit everything — multi-master databases need different protocols.
- Subtle commit-across-terms rule is the most common source of correctness bugs in new implementations.
- Snapshotting and membership changes are bolted on, not baked into the core algorithm — additional code paths that need careful testing.
Reference
Code & further reading
A minimal reference implementation and pointers worth bookmarking.
// Simplified Raft RPC handlers. The protocol fits in two messages:
// RequestVote (elections) and AppendEntries (heartbeats + replication).
type LogEntry = { term: number; cmd: string };
type Role = "follower" | "candidate" | "leader";
class RaftServer {
currentTerm = 0;
votedFor: number | null = null;
log: LogEntry[] = [];
commitIndex = -1;
role: Role = "follower";
// --- election ---
onRequestVote(req: { term: number; candidateId: number; lastLogIndex: number; lastLogTerm: number }) {
if (req.term > this.currentTerm) { this.currentTerm = req.term; this.votedFor = null; this.role = "follower"; }
const myLastTerm = this.log.at(-1)?.term ?? 0;
const myLastIdx = this.log.length - 1;
const upToDate = req.lastLogTerm > myLastTerm
|| (req.lastLogTerm === myLastTerm && req.lastLogIndex >= myLastIdx);
const free = this.votedFor === null || this.votedFor === req.candidateId;
const grant = req.term === this.currentTerm && free && upToDate;
if (grant) this.votedFor = req.candidateId;
return { term: this.currentTerm, voteGranted: grant };
}
// --- replication ---
onAppendEntries(req: {
term: number; leaderId: number; prevLogIndex: number; prevLogTerm: number;
entries: LogEntry[]; leaderCommit: number;
}) {
if (req.term < this.currentTerm) return { term: this.currentTerm, success: false };
if (req.term > this.currentTerm) { this.currentTerm = req.term; this.votedFor = null; }
this.role = "follower";
// log consistency check
const prev = this.log[req.prevLogIndex];
if (req.prevLogIndex >= 0 && (!prev || prev.term !== req.prevLogTerm)) {
return { term: this.currentTerm, success: false };
}
// truncate any conflicting suffix, then append
this.log = this.log.slice(0, req.prevLogIndex + 1).concat(req.entries);
if (req.leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(req.leaderCommit, this.log.length - 1);
this.applyToStateMachine();
}
return { term: this.currentTerm, success: true };
}
private applyToStateMachine() { /* apply log[0..commitIndex] in order */ }
}References & further reading
7 sources- Paperraft.github.io
Diego Ongaro & John Ousterhout — *In Search of an Understandable Consensus Algorithm (Extended Version)* (2014)
The Raft paper. 18 pages, plain prose, three diagrams. The single best introduction; aim to read it cover-to-cover.
- Docsraft.github.io
The Raft website — animated visualisation
Interactive visualisation of leader election, log replication, and partitions. Pair it with the paper for a deep grasp of the corner cases.
- Papergithub.com
Diego Ongaro — *Consensus: Bridging Theory and Practice* (PhD, 2014)
Ongaro's thesis. Goes far beyond the paper: membership changes, log compaction, snapshotting, client interactions, performance tuning.
- Docsgithub.com
etcd Raft library (Go)
The production reference implementation that powers etcd, CockroachDB, TiKV (originally), and many others. Read README and docs for the practical surface area.
- Articlethesecretlivesofdata.com
The Secret Lives of Data — Raft visualisation
Scroll-driven storybook walkthrough; great for the very first introduction before tackling the paper itself.
- Paperarxiv.org
Heidi Howard & Richard Mortier — *Paxos vs Raft: have we reached consensus?* (PaPoC 2020)
Side-by-side comparison showing Raft and Multi-Paxos are structurally equivalent. Useful when picking between them on a team.
- Articlejepsen.io
Aphyr / Jepsen — Raft analyses
Real-world Raft-based systems (etcd, Consul, CockroachDB) tested under fault injection. Pages of bug discoveries that teach you what production Raft has to defend against.
Knowledge check
Did the prototype land?
Quick questions, answers revealed on submit. No scoring saved.
question 01 / 03
What is a Raft "term" and why does it matter?
question 02 / 03
A follower votes for a candidate only if the candidate's log is "at least as up-to-date." What does up-to-date mean in Raft?
question 03 / 03
Which production system is built directly on a Raft implementation?
0/3 answered
Was this concept helpful?
Tell us what worked, or what to improve. We read every note.