Intermediate

Load Balancing

Run more than one server and something has to decide which one handles each request. Nine algorithms, from a blind counter to capacity-and-load-aware routing — built up one signal at a time.

distributedscalingthroughput

What is Load Balancing?

The 60-second primer

Load balancing is the practice of spreading incoming requests across a pool of servers so no single one becomes the bottleneck. The moment you run more than one instance of a service — for redundancy, for capacity, or both — something has to decide which instance handles each request. That decider is the load balancer, and the rule it follows is a load-balancing algorithm.

The contract looks trivial: a request arrives, pick a backend, forward it. The depth is entirely in the pick. A good choice keeps every server evenly utilized, tolerates servers of different sizes, and reacts when one slows down. A bad choice piles work onto an already-struggling node while its neighbors sit idle.

The algorithms divide into two families. Stateless strategies — round robin, weighted round robin, random — decide using only a counter or a coin flip, never looking at the servers. State-aware strategies — least connections, weighted least connections, and the response-time methods — track how busy each backend is and route to whoever is least loaded right now. Stateless is cheap and predictable; state-aware adapts but costs bookkeeping.

Why every scaled-out service needs it

  • Horizontal scaling — the only way to serve more traffic than one machine can handle is to add machines and split the work. The splitter is the load balancer.
  • High availability — when a backend dies, the balancer stops sending it traffic and the service stays up. No single instance is a single point of failure.
  • Even utilization — without balancing, hot spots form: one server saturates and starts dropping requests while others idle, wasting the capacity you paid for.
  • Heterogeneous fleets — real fleets mix machine sizes. Weighted algorithms send the big boxes proportionally more work so they all reach capacity together.
  • Graceful degradation under load — state-aware algorithms notice when a backend slows (GC pause, cold cache, noisy neighbor) and steer around it before it falls over.

Where it lives

Load balancers sit at every tier: an L4/L7 appliance or cloud LB at the edge (AWS ALB/NLB, Envoy, NGINX, HAProxy), a service mesh sidecar between microservices, and a client-side balancer inside RPC libraries (gRPC, Finagle). The algorithm is the same idea at every layer — only the thing being balanced changes.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

01Round Robin
Server state used
None (a counter)
Handles uneven load
Poorly — assumes equal servers & equal requests
Per-request cost
O(1)
Best for
Uniform fleets, uniform requests
02Weighted Round Robin
Server state used
Static weights
Handles uneven load
Handles uneven capacity, not uneven load
Per-request cost
O(1)
Best for
Mixed machine sizes, predictable work
03Random
Server state used
None (a coin flip)
Handles uneven load
Like round robin in the limit; lumpier short-term
Per-request cost
O(1), no shared state
Best for
Many distributed LBs with no coordination
04Least Connections
Server state used
Live active-connection count
Handles uneven load
Well — follows real-time busyness
Per-request cost
O(N) counters
Best for
Long-lived or variable-duration requests
05Weighted Least Connections
Server state used
Active count ÷ weight
Handles uneven load
Capacity and live load together
Per-request cost
O(N) + weights
Best for
Mixed fleets with variable request cost
06Least Response Time
Server state used
Active count × avg latency
Handles uneven load
Well — favors the servers answering fastest
Per-request cost
O(N) + latency
Best for
Latency-sensitive, mixed-speed backends
07Peak EWMA
Server state used
Decaying avg latency × active
Handles uneven load
Very well — tracks recent slowdowns, recovers slowly
Per-request cost
O(N) + α decay
Best for
Adaptive routing in noisy, real-world fleets
08IP Hash
Server state used
Hash of client identity
Handles uneven load
Doesn't — pins clients, not load
Per-request cost
O(1) hash
Best for
Sticky sessions without a shared store
09Power of Two Choices
Server state used
Load of 2 random servers
Handles uneven load
Nearly as well as least-connections
Per-request cost
O(1), ~stateless
Best for
Distributed / client-side balancing at scale

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

Decision guide

  • Identical servers, short and uniform requestsRound Robin. The simplest thing that works, and it works well here.
  • Servers of different sizes, but request cost is predictableWeighted Round Robin. Give each server a weight proportional to its capacity.
  • Many independent load balancers that can't share stateRandom. No coordination needed, and it matches round robin's fairness as volume grows.
  • Requests vary wildly in duration (DB queries, uploads, websockets)Least Connections. It tracks who is actually busy instead of assuming.
  • *Both at once — mixed fleet and variable request costWeighted Least Connections*. It divides live load by capacity.
  • Backends with genuinely different speeds, and latency mattersLeast Response Time. Scores by (active + 1) × average latency, routing around slow servers.
  • Adaptive routing in a noisy fleetPeak EWMA. A decaying average of recent latency, tunable via α — the basis of modern service-mesh and RPC balancers.
  • Sticky sessions with no shared storeIP Hash. Pins each client to a server by hashing its identity (reach for consistent hashing so churn doesn't reshuffle everyone).
  • Near-least-connections quality with almost no statePower of Two Choices. Probe two random servers and take the lighter — ideal for distributed or client-side balancers.

Start simple, escalate on evidence

Round robin is the right default. Reach for least-connections only when you can see uneven load in your metrics — long-tailed request durations or one backend running hot. Every step toward state-awareness buys adaptivity at the price of bookkeeping and, in distributed setups, stale-state hazards.

Related tracks

If this one clicks, try these next.