Intermediate

Load Balancing

Run more than one server and something has to decide which one handles each request. Nine algorithms, from a blind counter to capacity-and-load-aware routing — built up one signal at a time.

distributedscalingthroughput

Start with Round Robin

What is Load Balancing?

The 60-second primer

Load balancing is the practice of spreading incoming requests across a pool of servers so no single one becomes the bottleneck. The moment you run more than one instance of a service — for redundancy, for capacity, or both — something has to decide which instance handles each request. That decider is the load balancer, and the rule it follows is a load-balancing algorithm.

The contract looks trivial: a request arrives, pick a backend, forward it. The depth is entirely in the pick. A good choice keeps every server evenly utilized, tolerates servers of different sizes, and reacts when one slows down. A bad choice piles work onto an already-struggling node while its neighbors sit idle.

The algorithms divide into two families. Stateless strategies — round robin, weighted round robin, random — decide using only a counter or a coin flip, never looking at the servers. State-aware strategies — least connections, weighted least connections, and the response-time methods — track how busy each backend is and route to whoever is least loaded right now. Stateless is cheap and predictable; state-aware adapts but costs bookkeeping.

Why every scaled-out service needs it

Horizontal scaling — the only way to serve more traffic than one machine can handle is to add machines and split the work. The splitter is the load balancer.
High availability — when a backend dies, the balancer stops sending it traffic and the service stays up. No single instance is a single point of failure.
Even utilization — without balancing, hot spots form: one server saturates and starts dropping requests while others idle, wasting the capacity you paid for.
Heterogeneous fleets — real fleets mix machine sizes. Weighted algorithms send the big boxes proportionally more work so they all reach capacity together.
Graceful degradation under load — state-aware algorithms notice when a backend slows (GC pause, cold cache, noisy neighbor) and steer around it before it falls over.

Where it lives

Load balancers sit at every tier: an L4/L7 appliance or cloud LB at the edge (AWS ALB/NLB, Envoy, NGINX, HAProxy), a service mesh sidecar between microservices, and a client-side balancer inside RPC libraries (gRPC, Finagle). The algorithm is the same idea at every layer — only the thing being balanced changes.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

Algorithm	Server state used	Handles uneven load	Per-request cost	Best for
01Round Robin	None (a counter)	Poorly — assumes equal servers & equal requests	`O(1)`	Uniform fleets, uniform requests
02Weighted Round Robin	Static weights	Handles uneven capacity, not uneven load	`O(1)`	Mixed machine sizes, predictable work
03Random	None (a coin flip)	Like round robin in the limit; lumpier short-term	`O(1), no shared state`	Many distributed LBs with no coordination
04Least Connections	Live active-connection count	Well — follows real-time busyness	`O(N) counters`	Long-lived or variable-duration requests
05Weighted Least Connections	Active count ÷ weight	Capacity and live load together	`O(N) + weights`	Mixed fleets with variable request cost
06Least Response Time	Active count × avg latency	Well — favors the servers answering fastest	`O(N) + latency`	Latency-sensitive, mixed-speed backends
07Peak EWMA	Decaying avg latency × active	Very well — tracks recent slowdowns, recovers slowly	`O(N) + α decay`	Adaptive routing in noisy, real-world fleets
08IP Hash	Hash of client identity	Doesn't — pins clients, not load	`O(1) hash`	Sticky sessions without a shared store
09Power of Two Choices	Load of 2 random servers	Nearly as well as least-connections	`O(1), ~stateless`	Distributed / client-side balancing at scale

01Round Robin

Server state used: None (a counter)
Handles uneven load: Poorly — assumes equal servers & equal requests
Per-request cost: O(1)
Best for: Uniform fleets, uniform requests

02Weighted Round Robin

Server state used: Static weights
Handles uneven load: Handles uneven capacity, not uneven load
Per-request cost: O(1)
Best for: Mixed machine sizes, predictable work

03Random

Server state used: None (a coin flip)
Handles uneven load: Like round robin in the limit; lumpier short-term
Per-request cost: O(1), no shared state
Best for: Many distributed LBs with no coordination

04Least Connections

Server state used: Live active-connection count
Handles uneven load: Well — follows real-time busyness
Per-request cost: O(N) counters
Best for: Long-lived or variable-duration requests

05Weighted Least Connections

Server state used: Active count ÷ weight
Handles uneven load: Capacity and live load together
Per-request cost: O(N) + weights
Best for: Mixed fleets with variable request cost

06Least Response Time

Server state used: Active count × avg latency
Handles uneven load: Well — favors the servers answering fastest
Per-request cost: O(N) + latency
Best for: Latency-sensitive, mixed-speed backends

07Peak EWMA

Server state used: Decaying avg latency × active
Handles uneven load: Very well — tracks recent slowdowns, recovers slowly
Per-request cost: O(N) + α decay
Best for: Adaptive routing in noisy, real-world fleets

08IP Hash

Server state used: Hash of client identity
Handles uneven load: Doesn't — pins clients, not load
Per-request cost: O(1) hash
Best for: Sticky sessions without a shared store

09Power of Two Choices

Server state used: Load of 2 random servers
Handles uneven load: Nearly as well as least-connections
Per-request cost: O(1), ~stateless
Best for: Distributed / client-side balancing at scale

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

Decision guide

Identical servers, short and uniform requests → Round Robin. The simplest thing that works, and it works well here.
Servers of different sizes, but request cost is predictable → Weighted Round Robin. Give each server a weight proportional to its capacity.
Many independent load balancers that can't share state → Random. No coordination needed, and it matches round robin's fairness as volume grows.
Requests vary wildly in duration (DB queries, uploads, websockets) → Least Connections. It tracks who is actually busy instead of assuming.
*Both at once — mixed fleet and variable request cost → Weighted Least Connections*. It divides live load by capacity.
Backends with genuinely different speeds, and latency matters → Least Response Time. Scores by (active + 1) × average latency, routing around slow servers.
Adaptive routing in a noisy fleet → Peak EWMA. A decaying average of recent latency, tunable via α — the basis of modern service-mesh and RPC balancers.
Sticky sessions with no shared store → IP Hash. Pins each client to a server by hashing its identity (reach for consistent hashing so churn doesn't reshuffle everyone).
Near-least-connections quality with almost no state → Power of Two Choices. Probe two random servers and take the lighter — ideal for distributed or client-side balancers.

Start simple, escalate on evidence

Round robin is the right default. Reach for least-connections only when you can see uneven load in your metrics — long-tailed request durations or one backend running hot. Every step toward state-awareness buys adaptivity at the price of bookkeeping and, in distributed setups, stale-state hazards.

Concepts in this track

9 concepts, in order

Each links to a concept page with its own explanation, prototype, and quiz.

Round Robin

Deal requests out like cards — server 1, 2, 3, 4, repeat. Perfectly even counts, blind to everything else.

Beginner8 mintry it

Weighted Round Robin

Round robin with capacity: give the big servers a higher weight and they take proportionally more turns.

Beginner9 mintry it

Random

Flip a coin per request. Lumpy up close, perfectly even in the limit — and it needs no shared state at all.

Beginner8 mintry it

Least Connections

Route to the shortest queue. The first state-aware method — it follows real-time busyness, not request counts.

Intermediate10 mintry it

Weighted Least Connections

Route by active ÷ weight. Capacity and live load in one number — the most adaptive of the classic set.

Intermediate10 mintry it

Least Response Time

Pick the server that's been answering fastest — connection count weighted by measured latency.

Intermediate10 mintry it

EWMA — Exponentially Weighted Moving Average

Smooth each server's recent latency into a decaying average and route to the lowest. The basis of modern adaptive LBs.

Advanced11 mintry it

IP Hash

Hash the client's address to a server so the same client always lands on the same backend — sticky sessions, no shared store.

Intermediate9 mintry it

Power of Two Choices

Probe two servers at random, route to the less busy one. Almost stateless, almost as good as least connections.

Advanced11 mintry it

Related tracks

If this one clicks, try these next.

Rate Limiting

Control request throughput so a noisy client cannot starve everyone else. Compare the five canonical algorithms side-by-side.

Intermediate5 concepts · 45 min

distributedapithroughput

Consistent Hashing

Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.

Intermediate5 concepts · 55 min

distributedscalingpatterns

Circuit Breaker

When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.

Intermediate6 concepts · 70 min

distributedresiliencepatterns