Raft — Consensus

Overview

What this concept solves

Raft is Multi-Paxos rewritten so humans can understand it. Diego Ongaro and John Ousterhout published it in 2014 ("In Search of an Understandable Consensus Algorithm") with one explicit design goal: make the protocol teachable. They split consensus into three independent sub-problems — leader election, log replication, and safety — and built each one so its rules can be stated in a couple of sentences. The result became the default consensus algorithm of the 2020s.

Every server is in one of three states: Follower (passive, takes orders), Candidate (running for leader), or Leader (the one issuing commands). Time is divided into terms — a monotonically increasing integer. Each term has at most one leader. Followers know about the leader by receiving heartbeats; if heartbeats stop arriving, a follower's election timer fires and it becomes a candidate, increments the term, and asks for votes.

Once elected, the leader replicates client commands by appending them to its log and sending AppendEntries to every follower. When a majority has stored an entry, the leader marks it committed and applies it to the state machine. Followers learn about the commit via the next AppendEntries. That's it. The whole protocol — election + replication + a small safety rule on which logs are allowed to win elections — fits on a single diagram. That's why etcd, Consul, CockroachDB, TiDB, Kafka's KRaft, and dozens of other systems all ship it.

Mechanics

How it works

Leader election

Followers expect a heartbeat (an empty AppendEntries from the leader) every ~50ms. If their election timer (randomized between 150–300ms) fires first, they become a Candidate.
Candidate: increment term, vote for self, send RequestVote(term, lastLogIndex, lastLogTerm) to every other server.
Each server votes YES at most once per term, and only for a candidate whose log is at least as up-to-date as its own. The up-to-date test is: later term wins; same term, longer log wins.
If the candidate gets votes from a majority, it becomes leader and starts sending heartbeats. If it sees a higher term or another leader, it reverts to follower. If no winner (split vote), election times out and a new term begins with randomized backoff.

Log replication

Client sends a command to the leader. Leader appends (term, index, command) to its log.
Leader sends AppendEntries to every follower, carrying the new entry plus prevLogIndex/prevLogTerm for consistency check.
Follower accepts only if its log matches up to prevLogIndex/prevLogTerm. On mismatch, the leader decrements nextIndex for that follower and retries — that's how out-of-date followers catch up.
When the leader sees the entry stored on a majority, it advances commitIndex and applies the entry to its state machine. Followers learn the new commitIndex from the next AppendEntries and apply too.

The two safety rules in one paragraph

Election restriction: a server only grants its vote to a candidate whose log is at least as up-to-date as its own. Leader completeness: if an entry is committed in term T, every leader of any term ≥ T has that entry in its log. Together these guarantee that committed entries are never lost, even across arbitrary leader changes.

The one tricky case — committing entries from prior terms

A new leader cannot directly commit an entry from a previous term just because it's replicated on a majority — doing so could overwrite a committed entry in a corner case (Figure 8 in the paper). It must wait until an entry of its own current term is committed; that implicitly commits earlier entries too. This is the subtlety that makes Raft non-trivial and is the lesson new implementations most often get wrong.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Raft

About this simulation

Five servers running Raft. Pick a scenario — Leader election (a follower times out, requests votes, becomes leader), Replicate a command (one RTT of AppendEntries), or Crash the leader to see a new term begin. Free play queues commands and lets you crash whenever; the log card below holds only the last two messages.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch a leader get elected

Open Leader election and step through. A follower's election timer fires, it becomes a candidate, sends RequestVote, collects a majority (3 of 5), and becomes leader. Notice that the term advances by one — every election uses a fresh term so old leaders can't accidentally rejoin and confuse the cluster.

try 02

Replicate a command in one RTT

Switch to Replicate a command (or continue after election). Click a command button. The leader appends it (amber = tentative), broadcasts AppendEntries, collects acks from a majority, and commits (green = permanent). Followers commit on the next heartbeat. One round-trip per command — same hot path as Multi-Paxos.

try 03

Crash the leader

Run Crash the leader. Heartbeats stop, a follower's timer fires, a new election begins with the next term. Committed entries survive (a majority still holds them) and the new leader picks them up via the up-to-date-log rule. Notice that the cluster is briefly unavailable for writes — bounded but real.

try 04

Free play — break it yourself

Open Free play. Try crashing the leader mid-replication (between Append and Committed). Try queueing four commands fast. Notice that the leader-completeness rule means a follower with a stale log will never win an election — only candidates whose log is at least as up-to-date as the voter's get the vote. That single restriction is the entire safety story.

In practice

When to use it — and what you give up

When it's the right tool

Building a replicated state machine — key-value store, configuration registry, distributed lock service. Raft is the default in 2026.
Replicated logs of any kind — event sourcing, write-ahead logs across replicas, leader-based event streaming.
Teams that need to onboard new engineers quickly — Raft's three-part decomposition is what makes it the friendly choice over Paxos.
Heterogeneous environments — etcd's Raft library, Hashicorp's Raft (Go), Tikv (Rust), Apache Ratis (Java) — pick a language, you'll find a maintained Raft.

When to reach for something else

Byzantine fault model — Raft assumes crash failures only. Use PBFT or HotStuff.
Leaderless / multi-master writes — Raft funnels everything through the leader. EPaxos, Mencius, Generalized Paxos remove that bottleneck.
Atomic commit across services — that's 2PC (or a Raft-replicated coordinator), not pure Raft.
Extreme wide-area latency — Raft is bandwidth- and RTT-sensitive across the leader; geo-replication with witnesses or quorum tuning may suit better.

Pros

Designed for understandability — three sub-problems, two safety rules. The 2014 paper is one of the most readable algorithms papers of its era.
Strong-leader simplicity — only the leader appends to the log, only the leader talks to clients. Easy to reason about.
Same fault tolerance as Paxos — f failures of 2f+1 nodes.
Mature library ecosystem — production-grade implementations in Go, Rust, Java, C++, Python, Erlang.
One round-trip per command in steady state — same hot path as Multi-Paxos.

Cons

Leader is a write bottleneck — every write funnels through it; throughput caps at one node's resources.
Brief unavailability on leader failure — bounded by the election timeout (typically 100–500ms), but real clients see timeouts.
Strong-leader assumption doesn't fit everything — multi-master databases need different protocols.
Subtle commit-across-terms rule is the most common source of correctness bugs in new implementations.
Snapshotting and membership changes are bolted on, not baked into the core algorithm — additional code paths that need careful testing.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

raft.go

// Simplified Raft RPC handlers. The protocol fits in two messages:
// RequestVote (elections) and AppendEntries (heartbeats + replication).
package raft

type LogEntry struct {
	Term int
	Cmd  string
}

type Role int

const (
	Follower Role = iota
	Candidate
	Leader
)

type RaftServer struct {
	currentTerm int
	votedFor    int // -1 means none
	log         []LogEntry
	commitIndex int // -1 means empty
	role        Role
}

func NewRaftServer() *RaftServer {
	return &RaftServer{votedFor: -1, commitIndex: -1, role: Follower}
}

type RequestVoteReq struct {
	Term, CandidateID, LastLogIndex, LastLogTerm int
}
type RequestVoteResp struct {
	Term        int
	VoteGranted bool
}

// --- election ---
func (s *RaftServer) OnRequestVote(req RequestVoteReq) RequestVoteResp {
	if req.Term > s.currentTerm {
		s.currentTerm = req.Term
		s.votedFor = -1
		s.role = Follower
	}
	myLastTerm := 0
	if len(s.log) > 0 {
		myLastTerm = s.log[len(s.log)-1].Term
	}
	myLastIdx := len(s.log) - 1
	upToDate := req.LastLogTerm > myLastTerm ||
		(req.LastLogTerm == myLastTerm && req.LastLogIndex >= myLastIdx)
	free := s.votedFor == -1 || s.votedFor == req.CandidateID
	grant := req.Term == s.currentTerm && free && upToDate
	if grant {
		s.votedFor = req.CandidateID
	}
	return RequestVoteResp{Term: s.currentTerm, VoteGranted: grant}
}

type AppendEntriesReq struct {
	Term, LeaderID, PrevLogIndex, PrevLogTerm int
	Entries                                   []LogEntry
	LeaderCommit                              int
}
type AppendEntriesResp struct {
	Term    int
	Success bool
}

// --- replication ---
func (s *RaftServer) OnAppendEntries(req AppendEntriesReq) AppendEntriesResp {
	if req.Term < s.currentTerm {
		return AppendEntriesResp{Term: s.currentTerm, Success: false}
	}
	if req.Term > s.currentTerm {
		s.currentTerm = req.Term
		s.votedFor = -1
	}
	s.role = Follower

	// log consistency check
	if req.PrevLogIndex >= 0 {
		if req.PrevLogIndex >= len(s.log) || s.log[req.PrevLogIndex].Term != req.PrevLogTerm {
			return AppendEntriesResp{Term: s.currentTerm, Success: false}
		}
	}
	// truncate any conflicting suffix, then append
	s.log = append(s.log[:req.PrevLogIndex+1], req.Entries...)

	if req.LeaderCommit > s.commitIndex {
		s.commitIndex = req.LeaderCommit
		if last := len(s.log) - 1; last < s.commitIndex {
			s.commitIndex = last
		}
		s.applyToStateMachine()
	}
	return AppendEntriesResp{Term: s.currentTerm, Success: true}
}

func (s *RaftServer) applyToStateMachine() { /* apply log[0..commitIndex] in order */ }

References & further reading

7 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

What is a Raft "term" and why does it matter?

question 02 / 03

A follower votes for a candidate only if the candidate's log is "at least as up-to-date." What does up-to-date mean in Raft?

question 03 / 03

Which production system is built directly on a Raft implementation?

0/3 answered