Lease-Based Election — Leader Election

Overview

What this concept solves

Lease-based leader election is what you do when you already have a strongly-consistent key-value store and you don't want to run another consensus cluster on top. The store holds one key — call it /leader — with a value (holder, epoch, expiresAt). To lead, a process acquires that key with a TTL (compare-and-swap from "empty/expired" to "me, epoch+1, now+TTL"). To stay leader, it renews the key well before TTL elapses. If renewals stop, the TTL runs out, and the next process to try wins the empty slot.

The store is what makes this safe: every acquire goes through a single linearizable compare-and-swap, so at most one process can ever hold the lease in a given moment of store time. The TTL bounds how long a crashed leader's lease lingers. The epoch (also called a fence token) is the monotonically increasing revision returned by each acquire; every write the leader does carries the epoch, and the data store rejects writes from stale epochs — so a paused-then-resumed old leader cannot quietly resume control after a new one was elected. That's the entire safety argument, in three primitives.

This pattern is everywhere: Kubernetes controller-manager and scheduler use it via the coordination.k8s.io/Lease API; etcd and Consul ship native session/lease primitives that frameworks like Patroni, Vitess, and Knative wrap; Redis users implement it with SET key val NX PX ttl (with Redlock as the multi-master variant); Google's Chubby paper popularised it as a coordination primitive in 2006. If your problem is "exactly one process at a time runs job X," lease-based election is almost always the right answer.

Mechanics

How it works

The three primitives

Acquire — atomic compare-and-swap: if /leader is empty or expired, set it to (me, epoch+1, now+TTL) and return the new revision; else fail.
Renew — periodic compare-and-swap: if /leader.holder == me, extend expiresAt to now+TTL. Done well before expiry — typically at TTL/3.
Release — voluntary: set /leader to empty so a successor can grab it immediately, no waiting for expiry.

Lifecycle

A process tries to acquire. If the store's /leader is empty or expired, the CAS succeeds; the process becomes leader and starts the renew loop.
Renewals fire every TTL / 3 ms. As long as the leader is alive and not partitioned, the lease never expires.
If the leader crashes, gets partitioned, or its host pauses, renewals stop. The store's TTL counts down from expiresAt. Once now >= expiresAt, the lease is treated as empty.
Other processes are watching /leader (etcd watch, Redis Sentinel pub/sub, ZK watcher). Each tries to acquire on the next tick; one wins the CAS, becomes leader with epoch + 1.
All subsequent writes by the leader carry the epoch. The data store checks the token on every write: stale epoch → reject. This is what stops a resurrected old leader from corrupting anything.

The clock-skew gotcha — and why fence tokens are non-negotiable

Martin Kleppmann's famous 2016 critique of naive distributed locks is that TTL alone is not safe: if the leader's process pauses (long GC, VM stall, virt-suspended container) past the TTL, a new leader is elected — but when the old leader resumes, it doesn't know any of this. It continues writing to the data store as if it were still leader, producing two-writer disasters. The fix is a fence token (epoch / revision) carried on every write; the data store maintains "highest epoch I've seen" and rejects writes with stale tokens. etcd's leaseRevision, ZK's zxid, Spanner's TrueTime + lease all serve this role. Don't ship lease election without fence-token enforcement on the data path.

What TTL should I pick?

Three knobs interact: TTL (how long a stale lease lingers), renew interval (how often the leader pings the store), and failover SLA. Common rule of thumb: TTL = 10–30 s, renew at TTL/3 = 3–10 s, expect failover within TTL + one renew interval = up to ~40 s. Shorter TTL means faster failover but more store pressure and false expirations under load. Kubernetes defaults to a 15 s lease duration with 10 s renew deadline.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Lease-Based Election

About this simulation

Five processes contending for a single TTL-bounded lease in a shared store. Pick a scenario — Acquire & renew (the holder renews before each expiry to stay leader), Leader crash (renewals stop, the TTL drains to zero, another node takes over), or GC pause & fence (a paused leader resumes after the lease was reissued and gets rejected by the epoch token). Free play lets you crash any node and force the lease to expire; the log card below holds only the last two messages.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Walk Acquire & renew

Open Acquire & renew and step through. P1 grabs the empty lease (revision becomes 1). The TTL bar drains as time passes; before it empties, P1 renews and the bar refills. Followers see /leader = P1 and stay idle. Notice the cost: a constant heartbeat to the store regardless of activity. That's the price of using a single key as the election point.

try 02

Walk Leader crash

Run Leader crash. P1 dies. Renewals stop. The TTL bar drains to zero unattended — followers see the lease is still technically held by P1, so they wait. The moment expiresAt < now, the lease becomes empty and a follower (say P3) wins the next compare-and-swap with revision 2. Failover took up to one full TTL. This is the lease's defining latency cost.

try 03

Walk GC pause & fence

Run GC pause & fence. P1 holds revision 1; its process freezes (long GC pause). The TTL expires; P3 acquires with revision 2 and starts writing. P1 unfreezes and tries to write — with revision 1. The data store rejects the write (stale fence token, current is 2). P1 sees the error, steps down, and resets. Without the fence token, both processes would have written and corrupted state. This is the safety story.

try 04

Free play — break it yourself

Open Free play. Crash the leader at any TTL phase and watch the lease drain. Try voluntarily releasing the lease (instant handoff). Try crashing a follower — nothing happens, because only the leader's death matters. Try shortening the TTL slider: faster failover, more store traffic, more vulnerability to a brief pause being mistaken for a crash. Find the regime where your application would feel safe.

In practice

When to use it — and what you give up

When it's the right tool

Singleton job pattern — exactly-one cron, leader-only background worker, exactly-one controller. Kubernetes controllers, Spark drivers, scheduled compactions.
You already run a coordination service — etcd, Consul, ZooKeeper, or Redis. Lease election piggybacks on it for free; no new infrastructure.
You want simple application code — acquire returns a leadership token; the framework hides the renew loop and watch logic.
Multi-language clusters — your election logic lives in one shared service, not in every language's Raft port.

When to reach for something else

You don't yet have a strongly-consistent store — bootstrapping etcd just to host a lease is heavier than running Raft directly in your service.
Sub-second failover required — TTL has to outlive a brief pause, so failover is bounded by TTL. If you need <100 ms failover, Raft heartbeats are faster.
No fence-token enforcement on the data store — without it, lease election is not safe against GC pauses. Use a different scheme or add the fence token first.
Geo-replicated cluster on a flaky WAN — frequent partitions cause leadership thrash; consider per-region leaders or a more partition-aware design.

Pros

Trivial application code — three calls (acquire, renew, release) plus a fence token on writes.
Reuses an existing store — no new consensus cluster, no new operational surface.
Cross-language by construction — the store is the source of truth, application language is irrelevant.
Voluntary release is instant — graceful shutdown hands the lease off without waiting for TTL.
Fence token comes for free — most stores return a monotonically increasing revision on each acquire; carry it on every write.

Cons

Failover is bounded by TTL — a 15 s lease means up to 15 s of writes-unavailability after a crash.
Clock skew & process pauses are the failure mode — without a fence token, a paused leader can write after a new one is elected (Kleppmann's two-writers).
Tight coupling to the store — your availability is at most the store's availability.
Renewal traffic is constant — even a healthy leader hits the store every TTL/3 ms.
Thundering-herd on expiry — many followers may simultaneously try to acquire when the lease frees, briefly spiking the store.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

lease_election.go

// Lease-based leader election on top of an etcd-like store.
// The store provides linearizable CAS and returns a monotonically increasing revision.
package election

import (
	"errors"
	"time"
)

type Lease struct {
	Holder    string
	Epoch     int
	ExpiresAt int64 // unix millis
}

type CASResult struct {
	OK    bool
	Epoch int
}

type LeaseStore interface {
	Read() (*Lease, bool)
	// CAS: if current matches expected, write next and return new revision.
	CAS(expected *Lease, next Lease) CASResult
}

type LeaseLeader struct {
	me       string
	store    LeaseStore
	ttlMs    int64
	now      func() int64
	epoch    int
	isLeader bool
}

func NewLeaseLeader(me string, store LeaseStore, ttlMs int64, now func() int64) *LeaseLeader {
	return &LeaseLeader{me: me, store: store, ttlMs: ttlMs, now: now}
}

func (l *LeaseLeader) TryAcquire() bool {
	cur, present := l.store.Read()
	expired := !present || cur.ExpiresAt <= l.now()
	if !expired {
		return false
	}

	prevEpoch := 0
	var expected *Lease
	if present {
		prevEpoch = cur.Epoch
		expected = cur
	}
	r := l.store.CAS(expected, Lease{
		Holder:    l.me,
		Epoch:     prevEpoch + 1,
		ExpiresAt: l.now() + l.ttlMs,
	})
	if r.OK {
		l.epoch = r.Epoch
		l.isLeader = true
	}
	return r.OK
}

// RenewLoop runs forever: renew at TTL/3, give up the role on any CAS failure.
func (l *LeaseLeader) RenewLoop() {
	for l.isLeader {
		time.Sleep(time.Duration(l.ttlMs/3) * time.Millisecond)
		cur, present := l.store.Read()
		if !present || cur.Holder != l.me {
			l.isLeader = false
			return
		}
		next := *cur
		next.ExpiresAt = l.now() + l.ttlMs
		if !l.store.CAS(cur, next).OK {
			l.isLeader = false
		}
	}
}

// LeaderWrite: every leader write must carry the epoch so stale leaders can't sneak through.
func (l *LeaseLeader) LeaderWrite(write func(epoch int) error) error {
	if !l.isLeader {
		return errors.New("not leader")
	}
	return write(l.epoch)
}

References & further reading

7 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

Why does lease-based election need a TTL on the lease?

question 02 / 03

A leader's process pauses (long GC) for longer than the TTL. The lease expires, a new leader is elected, then the old leader's process resumes. What stops the old leader from corrupting state?

question 03 / 03

What's the most common production setup for lease-based leader election?

0/3 answered