Two-Phase Commit (2PC)

Overview

What this concept solves

Two-Phase Commit (2PC) is the textbook atomic-commit protocol — the one every distributed-systems course opens with. One process plays the role of coordinator; the rest are participants that each hold a piece of the transaction. The coordinator's job is to get every participant to commit, or get every participant to abort. Half-committed is forbidden.

The shape is exactly what the name says: two phases. In Phase 1 — Prepare, the coordinator asks every participant "can you commit?" Each participant prepares — writes the transaction to a durable log, holds the locks — then replies YES or NO. In Phase 2 — Decide, the coordinator looks at the votes. If every reply was YES, it broadcasts COMMIT. If even one was NO (or never arrived), it broadcasts ABORT. Either way the decision is unanimous.

2PC delivers atomicity — but it pays a famous price. If the coordinator dies after collecting YES votes but before broadcasting the decision, the participants are stuck. They can't unilaterally abort (the coordinator might still be alive and have told someone else to commit), and they can't unilaterally commit (the same logic, in reverse). They block, holding locks, until the coordinator comes back. That blocking is what every protocol after 2PC tries to fix.

Mechanics

How it works

Phase 1 — Prepare (the vote)

Coordinator writes BEGIN to its log, then sends PREPARE to every participant.
Each participant tentatively performs the work: locks rows, writes redo/undo records, fsyncs a PREPARED log entry. From this point on, the participant has promised it can commit if asked.
The participant replies YES if it managed to prepare, or NO if it ran out of disk, hit a constraint, or otherwise can't commit.

Phase 2 — Decide (the broadcast)

If every vote is YES, the coordinator writes COMMIT to its log (this is the moment of decision) and sends COMMIT to every participant.
If any vote is NO, the coordinator writes ABORT and sends ABORT to every participant.
Each participant applies the decision permanently, releases locks, fsyncs the outcome, and replies ACK.

The blocking failure

If the coordinator crashes after some participants received COMMIT but others did not, the survivors are stuck. They asked their peers, but the peers don't know the decision either — the coordinator's log is the single source of truth. They wait, holding locks, until the coordinator recovers (it reads its log and resumes). This is the famous blocking case, and the entire reason 3PC, Paxos-replicated coordinators, and saga patterns exist.

The atomicity guarantee in one sentence

Once the coordinator writes COMMIT, every participant will eventually commit (even if they have to be reminded after recovery); once the coordinator writes ABORT, every participant will eventually abort. The protocol never leaves participants in disagreement once the coordinator has written its decision.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Two-Phase Commit (2PC)

About this simulation

Three participants and one coordinator running a 2PC transaction. Pick a scenario — Happy path, A participant votes NO, or Coordinator crashes, or jump into Free play and toggle votes yourself. Use Prev / Next / Auto / Restart; the message log below the prototype keeps only the last two lines so it never grows.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Walk the Happy path

Open the Happy path scenario and step through with Next. Watch the coordinator send PREPARE, all three participants reply YES, the coordinator write COMMIT, and every participant move to COMMITTED. Notice that the decision (COMMIT) is written before it's broadcast — that is the durable atomicity anchor.

try 02

Make a participant vote NO

Switch to the A participant votes NO scenario. One participant — say P2 — replies NO during the vote. The coordinator immediately writes ABORT and tells everyone to roll back. Notice that even the participants who voted YES still abort: the rule is unanimous YES required, not majority.

try 03

Crash the coordinator mid-decision

Run the Coordinator crashes scenario. Every participant has voted YES, then the coordinator dies before sending COMMIT. The participants are stuck — they have promised to commit if asked but cannot proceed without the coordinator's word. This is the blocking case, and the whole motivation for 3PC.

try 04

Free play — break it yourself

Open Free play and toggle the per-participant vote checkboxes before stepping through. Try every combination: one NO, two NOs, three YESes. Try crashing the coordinator at different points (after Prepare, after votes, after Commit). The protocol holds the same invariant every time — and the same blocking weakness every time.

In practice

When to use it — and what you give up

When it's the right tool

Cross-shard / cross-service transactions where you need true atomicity and the participants are known ahead of time — XA distributed transactions, database-internal cross-partition writes, SQL across multiple PostgreSQL shards.
Short transactions where the blocking case is rare and acceptable — milliseconds-long writes, intra-datacenter calls with reliable nodes.
You can replicate the coordinator with a real consensus protocol (Raft / Paxos) so its decision survives its crash — this is exactly what Spanner does.
The simplest possible mental model is your priority — 2PC is what every operator already understands, and that has real value.

When to reach for something else

Long-running workflows across services (order → payment → shipping) — use a [saga](https://microservices.io/patterns/data/saga.html) with compensating actions instead. 2PC's lock-holding hurts.
You cannot tolerate the blocking case — replicate the coordinator (Raft), or use 3PC, or restructure as eventual consistency with reconciliation.
Replicated state machines and log replication — that's Paxos and Raft territory, not 2PC.
Byzantine fault model — 2PC assumes participants only crash, not lie. Use PBFT or similar.

Pros

Simplest possible atomic-commit protocol — two phases, three message types, fits on a whiteboard.
Strong atomicity guarantee — every participant commits, or every participant aborts. No half-states once the coordinator decides.
Well-understood — XA, JTA, MS DTC, all the textbooks; battle-tested across decades of databases.
Cheap in the happy path — just two round-trips and no consensus quorum machinery.
Compositional — you can wrap any resource manager (DB, queue, file system) that exposes prepare/commit/abort.

Cons

Blocks on coordinator failure — the famous case: coordinator dies after collecting votes, participants hang holding locks.
Holds locks across both phases — long latency multiplies contention.
All-or-nothing fragility — one slow or dead participant stalls every other one.
Synchronous and chatty — every participant must be reachable for every transaction.
No fault tolerance on the decision itself — without external replication, the coordinator is a single point of failure.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

two_phase_commit.go

// Coordinator-side 2PC. Each participant exposes Prepare/Commit/Abort
// and persists its decision durably before replying.
type Vote string

const (
	VoteYes Vote = "YES"
	VoteNo  Vote = "NO"
)

type Participant interface {
	Prepare(txId string) (Vote, error)
	Commit(txId string) error
	Abort(txId string) error
}

type Log interface {
	Write(line string) error
}

func twoPhaseCommit(txId string, participants []Participant, log Log) string {
	log.Write("BEGIN " + txId)

	// Phase 1 — Prepare
	votes := make([]Vote, len(participants))
	var mu sync.Mutex
	g, _ := errgroup.WithContext(context.Background())
	for i, p := range participants {
		i, p := i, p
		g.Go(func() error {
			v, err := p.Prepare(txId)
			if err != nil {
				return err
			}
			mu.Lock()
			votes[i] = v
			mu.Unlock()
			return nil
		})
	}
	if err := g.Wait(); err != nil {
		// a participant failed to prepare — treat as NO
		log.Write("ABORT " + txId)
		abortAll(participants, txId) // best-effort, like Promise.allSettled
		return "ABORTED"
	}

	// Phase 2 — Decide
	allYes := true
	for _, v := range votes {
		if v != VoteYes {
			allYes = false
			break
		}
	}
	if allYes {
		log.Write("COMMIT " + txId) // <- the durable decision
		// Retry forever; participants must eventually apply the commit.
		retryAll(func() []func() error {
			calls := make([]func() error, len(participants))
			for i, p := range participants {
				p := p
				calls[i] = func() error { return p.Commit(txId) }
			}
			return calls
		})
		return "COMMITTED"
	}
	log.Write("ABORT " + txId)
	retryAll(func() []func() error {
		calls := make([]func() error, len(participants))
		for i, p := range participants {
			p := p
			calls[i] = func() error { return p.Abort(txId) }
		}
		return calls
	})
	return "ABORTED"
}

// On recovery, the coordinator reads its log: if it sees COMMIT/ABORT
// for txId, it replays the broadcast. If it sees BEGIN but no decision,
// it aborts. That single-source-of-truth log is what makes 2PC atomic.

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

In 2PC, what happens if the coordinator crashes immediately after writing `COMMIT` to its own log but before sending any COMMIT message?

question 02 / 03

How many YES votes does the coordinator need before it can decide to commit?

question 03 / 03

Which production system avoids the 2PC blocking problem by replicating the coordinator itself?

0/3 answered