Load Balancing
Run more than one server and something has to decide which one handles each request. Nine algorithms, from a blind counter to capacity-and-load-aware routing — built up one signal at a time.
What is Load Balancing?
The 60-second primer
Load balancing is the practice of spreading incoming requests across a pool of servers so no single one becomes the bottleneck. The moment you run more than one instance of a service — for redundancy, for capacity, or both — something has to decide which instance handles each request. That decider is the load balancer, and the rule it follows is a load-balancing algorithm.
The contract looks trivial: a request arrives, pick a backend, forward it. The depth is entirely in the pick. A good choice keeps every server evenly utilized, tolerates servers of different sizes, and reacts when one slows down. A bad choice piles work onto an already-struggling node while its neighbors sit idle.
The algorithms divide into two families. Stateless strategies — round robin, weighted round robin, random — decide using only a counter or a coin flip, never looking at the servers. State-aware strategies — least connections, weighted least connections, and the response-time methods — track how busy each backend is and route to whoever is least loaded right now. Stateless is cheap and predictable; state-aware adapts but costs bookkeeping.
Why every scaled-out service needs it
- Horizontal scaling — the only way to serve more traffic than one machine can handle is to add machines and split the work. The splitter is the load balancer.
- High availability — when a backend dies, the balancer stops sending it traffic and the service stays up. No single instance is a single point of failure.
- Even utilization — without balancing, hot spots form: one server saturates and starts dropping requests while others idle, wasting the capacity you paid for.
- Heterogeneous fleets — real fleets mix machine sizes. Weighted algorithms send the big boxes proportionally more work so they all reach capacity together.
- Graceful degradation under load — state-aware algorithms notice when a backend slows (GC pause, cold cache, noisy neighbor) and steer around it before it falls over.
Where it lives
Load balancers sit at every tier: an L4/L7 appliance or cloud LB at the edge (AWS ALB/NLB, Envoy, NGINX, HAProxy), a service mesh sidecar between microservices, and a client-side balancer inside RPC libraries (gRPC, Finagle). The algorithm is the same idea at every layer — only the thing being balanced changes.
Side-by-side
How they compare
The same concepts, on the same axes. Use this as a map; the individual pages are the territory.
| Algorithm | Server state used | Handles uneven load | Per-request cost | Best for |
|---|---|---|---|---|
01Round Robin | None (a counter) | Poorly — assumes equal servers & equal requests | O(1) | Uniform fleets, uniform requests |
02Weighted Round Robin | Static weights | Handles uneven capacity, not uneven load | O(1) | Mixed machine sizes, predictable work |
03Random | None (a coin flip) | Like round robin in the limit; lumpier short-term | O(1), no shared state | Many distributed LBs with no coordination |
04Least Connections | Live active-connection count | Well — follows real-time busyness | O(N) counters | Long-lived or variable-duration requests |
05Weighted Least Connections | Active count ÷ weight | Capacity and live load together | O(N) + weights | Mixed fleets with variable request cost |
06Least Response Time | Active count × avg latency | Well — favors the servers answering fastest | O(N) + latency | Latency-sensitive, mixed-speed backends |
07Peak EWMA | Decaying avg latency × active | Very well — tracks recent slowdowns, recovers slowly | O(N) + α decay | Adaptive routing in noisy, real-world fleets |
08IP Hash | Hash of client identity | Doesn't — pins clients, not load | O(1) hash | Sticky sessions without a shared store |
09Power of Two Choices | Load of 2 random servers | Nearly as well as least-connections | O(1), ~stateless | Distributed / client-side balancing at scale |
- Server state used
- None (a counter)
- Handles uneven load
- Poorly — assumes equal servers & equal requests
- Per-request cost
O(1)- Best for
- Uniform fleets, uniform requests
- Server state used
- Static weights
- Handles uneven load
- Handles uneven capacity, not uneven load
- Per-request cost
O(1)- Best for
- Mixed machine sizes, predictable work
- Server state used
- None (a coin flip)
- Handles uneven load
- Like round robin in the limit; lumpier short-term
- Per-request cost
O(1), no shared state- Best for
- Many distributed LBs with no coordination
- Server state used
- Live active-connection count
- Handles uneven load
- Well — follows real-time busyness
- Per-request cost
O(N) counters- Best for
- Long-lived or variable-duration requests
- Server state used
- Active count ÷ weight
- Handles uneven load
- Capacity and live load together
- Per-request cost
O(N) + weights- Best for
- Mixed fleets with variable request cost
- Server state used
- Active count × avg latency
- Handles uneven load
- Well — favors the servers answering fastest
- Per-request cost
O(N) + latency- Best for
- Latency-sensitive, mixed-speed backends
- Server state used
- Decaying avg latency × active
- Handles uneven load
- Very well — tracks recent slowdowns, recovers slowly
- Per-request cost
O(N) + α decay- Best for
- Adaptive routing in noisy, real-world fleets
- Server state used
- Hash of client identity
- Handles uneven load
- Doesn't — pins clients, not load
- Per-request cost
O(1) hash- Best for
- Sticky sessions without a shared store
- Server state used
- Load of 2 random servers
- Handles uneven load
- Nearly as well as least-connections
- Per-request cost
O(1), ~stateless- Best for
- Distributed / client-side balancing at scale
Decision guide
Which one should you use?
A practical tour of when each algorithm wins.
Decision guide
- Identical servers, short and uniform requests → Round Robin. The simplest thing that works, and it works well here.
- Servers of different sizes, but request cost is predictable → Weighted Round Robin. Give each server a weight proportional to its capacity.
- Many independent load balancers that can't share state → Random. No coordination needed, and it matches round robin's fairness as volume grows.
- Requests vary wildly in duration (DB queries, uploads, websockets) → Least Connections. It tracks who is actually busy instead of assuming.
- *Both at once — mixed fleet and variable request cost → Weighted Least Connections*. It divides live load by capacity.
- Backends with genuinely different speeds, and latency matters → Least Response Time. Scores by
(active + 1) × average latency, routing around slow servers. - Adaptive routing in a noisy fleet → Peak EWMA. A decaying average of recent latency, tunable via α — the basis of modern service-mesh and RPC balancers.
- Sticky sessions with no shared store → IP Hash. Pins each client to a server by hashing its identity (reach for consistent hashing so churn doesn't reshuffle everyone).
- Near-least-connections quality with almost no state → Power of Two Choices. Probe two random servers and take the lighter — ideal for distributed or client-side balancers.
Start simple, escalate on evidence
Round robin is the right default. Reach for least-connections only when you can see uneven load in your metrics — long-tailed request durations or one backend running hot. Every step toward state-awareness buys adaptivity at the price of bookkeeping and, in distributed setups, stale-state hazards.
Concepts in this track
9 concepts, in order
Each links to a concept page with its own explanation, prototype, and quiz.
Round Robin
Deal requests out like cards — server 1, 2, 3, 4, repeat. Perfectly even counts, blind to everything else.
Weighted Round Robin
Round robin with capacity: give the big servers a higher weight and they take proportionally more turns.
Random
Flip a coin per request. Lumpy up close, perfectly even in the limit — and it needs no shared state at all.
Least Connections
Route to the shortest queue. The first state-aware method — it follows real-time busyness, not request counts.
Weighted Least Connections
Route by active ÷ weight. Capacity and live load in one number — the most adaptive of the classic set.
Least Response Time
Pick the server that's been answering fastest — connection count weighted by measured latency.
EWMA — Exponentially Weighted Moving Average
Smooth each server's recent latency into a decaying average and route to the lowest. The basis of modern adaptive LBs.
IP Hash
Hash the client's address to a server so the same client always lands on the same backend — sticky sessions, no shared store.
Power of Two Choices
Probe two servers at random, route to the less busy one. Almost stateless, almost as good as least connections.
Related tracks
If this one clicks, try these next.
Rate Limiting
Control request throughput so a noisy client cannot starve everyone else. Compare the five canonical algorithms side-by-side.
Cache Write Policies
Three ways to handle a write when you have a cache in front of the store. Each policy is a different bet about durability, throughput, and how stale your data is allowed to get.
Cache Eviction
When the cache fills up, something has to go — and which one you pick decides your hit rate. Ten classic policies, side-by-side.