Multi-Paxos — Consensus

Overview

What this concept solves

Multi-Paxos is what you actually use when you want a replicated log built on Paxos. Basic Paxos agrees on one value; a real system needs a sequence of values (a log of commands). The naive answer — run Basic Paxos once per log slot — burns two round-trips per command, plus risks live-lock from duelling proposers. Multi-Paxos fixes both with a single move: elect a stable leader once, then skip Phase 1 forever after.

The structure: a designated proposer (the leader) runs Paxos Phase 1 — Prepare/Promise — across a quorum, but with an extended ballot number that covers every future slot. Once a majority has promised, the leader has "won the right to propose" for as many slots as it wants. Each subsequent command goes through Phase 2 only — one Accept round-trip and the value is chosen. That's a 2× latency win in steady state, and the dueling-proposer livelock simply can't happen because there's only one proposer.

When the leader crashes, a survivor starts a new election with a higher ballot number, runs Phase 1 again, and takes over. Committed log entries are safe because they're already on a majority — the new leader, by the Paxos safety rule, will always learn about them during Promise. This same shape — leader + Phase-2-only steady state + leader change — is exactly what Raft and ZAB inherited. Multi-Paxos is the algorithm the modern world copies from.

Mechanics

How it works

Phase 1 (rare) — Elect a leader

Candidate picks a ballot number b higher than any seen before and sends PREPARE(b) once, to every acceptor — covering all future slots, not a specific one.
Each acceptor: promises not to accept anything below b and replies with the highest-numbered value already accepted in every slot it knows about.
When a majority promises, the candidate is the leader. It must reconcile any partially-accepted values from those promises (re-running Accept for any pending slot) before accepting new commands.

Phase 2 (common) — Stream commands

Client sends a command to the leader.
Leader assigns the next free slot, picks its own value, and sends ACCEPT(b, slot, value) to every acceptor.
Each acceptor stores the entry durably (if b matches its promise) and replies ACCEPTED.
When a majority has stored it, the leader marks the slot committed and applies it to the local state machine. It piggybacks the commit on the next ACCEPT so followers apply it too. That's one RTT per command.

Why Phase 1 is amortised

One Prepare covers all future slots. As long as the leader stays alive, you never run Phase 1 again — Prepare's cost is paid once per leader, not once per command. With a leader lasting hours, the per-command cost is essentially one Accept round-trip.

Leader-change recovery is the hard part

A new leader's Promise messages might report different values for different slots (some had ACCEPTED on one acceptor but not a majority). The leader must fill those slots before serving new traffic — re-running Accept for each. Real implementations also handle log compaction, snapshots, and membership changes here, which is why "Paxos Made Live" is twice as long as "Paxos Made Simple."

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Multi-Paxos

About this simulation

Five nodes running Multi-Paxos. Pick a scenario — Elect a leader (the one-time Phase 1), Stream commands (Phase 2 only, one RTT each), or Crash the leader to watch re-election kick in. Free play lets you queue arbitrary commands and crash whenever; the message log keeps only the last two lines.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch the one-time leader election

Open the Elect a leader scenario. Step through Prepare → Promise → "Leader set." That's the two-RTT Phase 1, paid exactly once. From here on, every command will be cheap.

try 02

Stream commands at one RTT each

Switch to Stream commands (or stay on the same scenario after election). Hit the command buttons. Each command goes Leader → ACCEPT → majority ACCEPTED → Committed. One round-trip, no Prepare. Compare against Basic Paxos's two-RTT-per-value cost.

try 03

Crash the leader and watch failover

Run Crash the leader at any point in steady state. The committed entries are safe (a majority still holds them). A survivor runs Phase 1 again with a higher ballot, becomes the new leader, and resumes streaming. This is the recovery story Multi-Paxos sells.

try 04

Free play — break it yourself

Open Free play. Try crashing the leader mid-command (after Accept but before Committed): the new leader must reconcile the pending slot. Try crashing again right after the new election. Try queueing four commands in a row to see throughput. The structural pattern — leader, slots, log, recovery — is identical to what you'll see next in Raft.

In practice

When to use it — and what you give up

When it's the right tool

Replicated state machines / logs — exactly the job description. Google Chubby, Megastore, parts of Spanner, and many F1 components run Multi-Paxos.
Existing Paxos investment — if your team already speaks Paxos, you keep the wisdom; the structure is the same, just amortised.
You want the smallest-possible message envelope — a Phase-2-only Accept is roughly the minimum protocol for committed-by-majority replication.
You need the proven foundation — Multi-Paxos has been formally verified, model-checked, and run at hyperscale for two decades.

When to reach for something else

You're picking a new replication library in 2026 — pick Raft. Same shape, friendlier to humans, more tooling.
Byzantine fault model — Multi-Paxos assumes crash failures only. Use PBFT.
Single-decree only — if you really do agree on exactly one value, Basic Paxos is enough.
You're new to the protocol and on a deadline — the recovery edge cases (slot reconciliation, log gaps, membership changes) are where bugs live. Raft was designed exactly to make these easier.

Pros

Amortises the cost of Phase 1 — one Prepare per leader lifetime, then one Accept per command. Roughly half the latency of Basic Paxos.
No dueling-proposer livelock in steady state — only the leader proposes; followers can't compete.
Same fault tolerance as Basic Paxos — f failures of 2f+1 nodes.
Battle-tested at scale — Google has run this in production for over twenty years.
Generalises cleanly — Multi-Paxos is the structural template that Raft, ZAB, EPaxos, and Compartmentalised Paxos all build on.

Cons

Leader change is intricate — slot reconciliation across the promise replies has fiddly edge cases that take care to implement.
Specification has more dark corners than Basic Paxos — log compaction, snapshots, membership changes, leader leases — none of these are in the original paper.
Leader is a throughput bottleneck — every write goes through it, so one node's CPU/network caps the cluster.
Brief unavailability during election — bounded but real; clients see a few hundred ms of timeouts.
Harder to reason about than Raft — same algorithm in essentials, but Raft's split into election + log replication is friendlier.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

multi_paxos.go

// Multi-Paxos leader. Election is rare; the steady-state hot path is
// just Propose() which sends a single ACCEPT round-trip per command.

// External helpers (defined elsewhere):
//   func nextHigherBallot(ballot, myID int) int
//   func majority(n int) int
//   func collectPendingSlots(promises []*Promise) []PendingSlot
//   type Acceptor interface {
//       Prepare(ballot int) *Promise            // nil promise means NACK
//       Accept(ballot, slot int, value string) string // "ACCEPTED" or other
//   }
//   type PendingSlot struct { Slot int; Value string }

type LogEntry struct {
    Value     string
    Committed bool
}

type MultiPaxosLeader struct {
    acceptors []Acceptor
    myID      int
    ballot    int
    nextSlot  int
    log       []*LogEntry
}

func NewMultiPaxosLeader(acceptors []Acceptor, myID int) *MultiPaxosLeader {
    return &MultiPaxosLeader{
        acceptors: acceptors,
        myID:      myID,
        ballot:    0,
        nextSlot:  0,
    }
}

func (l *MultiPaxosLeader) ElectSelf() bool {
    l.ballot = nextHigherBallot(l.ballot, l.myID)

    // ONE prepare covers all future slots. Fan out concurrently.
    promises := make([]*Promise, len(l.acceptors))
    var wg sync.WaitGroup
    for i, a := range l.acceptors {
        wg.Add(1)
        go func(i int, a Acceptor) {
            defer wg.Done()
            promises[i] = a.Prepare(l.ballot)
        }(i, a)
    }
    wg.Wait()

    goodPromises := make([]*Promise, 0, len(promises))
    for _, p := range promises {
        if p != nil { // nil promise == NACK
            goodPromises = append(goodPromises, p)
        }
    }
    if len(goodPromises) < majority(len(l.acceptors)) {
        return false
    }

    // Reconcile partially-accepted slots: for each slot, if any promise
    // reported an acceptedV at any ballot, the leader MUST re-propose it.
    pendingSlots := collectPendingSlots(goodPromises)
    for _, ps := range pendingSlots {
        l.runAccept(ps.Slot, ps.Value)
        for len(l.log) <= ps.Slot {
            l.log = append(l.log, nil)
        }
        l.log[ps.Slot] = &LogEntry{Value: ps.Value, Committed: true}
    }
    l.nextSlot = len(l.log)
    return true // leader established; from here on, only Phase 2.
}

func (l *MultiPaxosLeader) Propose(value string) int {
    slot := l.nextSlot
    l.nextSlot++
    for len(l.log) <= slot {
        l.log = append(l.log, nil)
    }
    l.log[slot] = &LogEntry{Value: value, Committed: false}
    ok := l.runAccept(slot, value)
    if ok {
        l.log[slot].Committed = true
    }
    return slot
}

func (l *MultiPaxosLeader) runAccept(slot int, value string) bool {
    replies := make([]string, len(l.acceptors))
    var wg sync.WaitGroup
    for i, a := range l.acceptors {
        wg.Add(1)
        go func(i int, a Acceptor) {
            defer wg.Done()
            replies[i] = a.Accept(l.ballot, slot, value)
        }(i, a)
    }
    wg.Wait()

    accepted := 0
    for _, r := range replies {
        if r == "ACCEPTED" {
            accepted++
        }
    }
    return accepted >= majority(len(l.acceptors))
}

// Notice: hot path is one all-to-all ACCEPT and a quorum count. No
// Prepare. That's the whole point of Multi-Paxos.

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

What's the central optimisation Multi-Paxos makes over running Basic Paxos repeatedly?

question 02 / 03

A new Multi-Paxos leader has just won election. Its promises report that one acceptor has an ACCEPTED entry for slot 7 but no other acceptor knows about it. What must the leader do?

question 03 / 03

Multi-Paxos and Raft share the same essential structure. Which of the following is the cleanest summary of the difference?

0/3 answered