Least Response Time — Load Balancing

Overview

What this concept solves

Least connections assumes that one active connection is as costly as any other. But a connection to a server that answers in 50ms is nothing like a connection to one limping along at 2 seconds. Least response time fixes this by folding measured latency into the decision: it scores each server by how busy it is and how fast it has been answering, then routes to the lowest score.

The score the prototype shows is the classic form used by NGINX Plus and others: (active + 1) × average_response_time. The +1 accounts for the request you're about to add. Multiply by the server's recent average latency and you get an estimate of how long a new request would actually wait there. Pick the smallest estimate — that's least response time.

Mechanics

How it works

Score = (active + 1) × avg_rt

Two signals combine into one number per server:

Active connections — the live queue depth, exactly like least connections. The +1 models the incoming request.
Average response time — a running average of how long this server's recent requests took to complete.

Their product approximates the expected latency a new request would see: a short queue in front of a fast server scores low and wins; a short queue in front of a slow server scores higher and gets skipped. The prototype renders this as a live equation per server and flags the winner (← lowest, will be picked) and the laggard (avoided).

Where the response time comes from

The average is updated each time a request completes. A plain running mean works, but it reacts slowly and never forgets — a server that was briefly slow an hour ago still carries that weight. That lag is the motivation for the next concept, EWMA, which decays old samples exponentially so the score tracks recent behavior. (This prototype already uses a light exponential update under the hood; EWMA makes the decay rate an explicit, tunable knob.)

Latency is a lagging signal

Response time only updates when requests finish. A server that just fell off a cliff still looks fast until its slow responses start landing — so least response time reacts a beat late to sudden degradation. The active-connection term helps: a stalling server accumulates active connections immediately, nudging its score up before the latency catches up.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Least Response Time

About this simulation

Least connections, upgraded to also weigh how fast each server answers. The score row spells out the math live — (active + 1) × avg_rt — and the lowest score wins. Crank Server 3's latency multiplier up and watch the LB route around it even when its connection count is low.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Read the score equation

Hit 'Auto' and watch the score grid: each row shows (active + 1) × avg_rt = score, and the lowest is tagged ← lowest, will be picked. Note Server 3 starts with a ×3 latency multiplier, so its avg_rt climbs higher than the others — and its row is the one most often marked avoided.

try 02

Make a server slow on purpose

Use the 'S1 latency' stepper to push Server 1 up to ×4. Its connection line thins (it's getting less work) and its score climbs. Even when S1 has the fewest active connections, the LB skips it because the latency term dominates — that's the whole difference from plain least connections.

try 03

Flatten the fleet

Set every server's latency multiplier back to ×1. Now avg_rt is roughly equal everywhere, so the score collapses to (active + 1) — and the algorithm behaves exactly like Least Connections. Watch the 'RT spread' stat shrink toward zero as the fleet evens out.

In practice

When to use it — and what you give up

When to reach for it

Backends with genuinely different speeds — mixed hardware, mixed locations, or services where some instances are cold/warm.
Latency-sensitive traffic — user-facing request paths where p99 matters more than raw even distribution.
When connection count alone misleads — a server can hold few connections yet be slow; latency-awareness catches that.
A balancer that can observe response times — you need completion timing, which an L7 proxy or smart client has but a dumb L4 forwarder may not.

Real-world example

NGINX Plus offers least_time (by header or last-byte), and many service meshes default to a latency-aware policy. It's the natural next step once you have response-time telemetry and least-connections is no longer enough.

Pros

Accounts for server speed, not just queue depth — routes around slow backends automatically.
Directly targets the thing users feel: expected latency.
Degrades gracefully — the active-connection term reacts instantly even before latency updates.
A small, cheap formula on top of the counters least connections already keeps.

Cons

Response time is a lagging signal; reaction to sudden slowdowns is delayed.
A plain running average never forgets — stale slow samples linger (EWMA fixes this).
Requires measuring per-request completion time, which dumb L4 balancers can't do.
All the distributed-state hazards of least connections still apply.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

least_response_time.go

package lb

// Least response time: score = (active + 1) * avgRtSeconds, lowest wins.
type Backend struct {
	id      string
	active  int
	avgRtMs float64
	samples int
}

const defaultRtMs = 1500.0 // used until a server has answered once

type LeastResponseTimeBalancer struct {
	backends []*Backend
}

func NewLeastResponseTimeBalancer(ids []string) *LeastResponseTimeBalancer {
	bs := make([]*Backend, len(ids))
	for i, id := range ids {
		bs[i] = &Backend{id: id}
	}
	return &LeastResponseTimeBalancer{backends: bs}
}

func (lb *LeastResponseTimeBalancer) score(b *Backend) float64 {
	rt := b.avgRtMs
	if rt == 0 {
		rt = defaultRtMs
	}
	return float64(b.active+1) * (rt / 1000)
}

func (lb *LeastResponseTimeBalancer) Acquire() *Backend {
	best := lb.backends[0]
	for _, b := range lb.backends {
		if lb.score(b) < lb.score(best) {
			best = b
		}
	}
	best.active++ // reserve immediately, like least-conn
	return best
}

// Release is called on completion with the observed round-trip time.
func (lb *LeastResponseTimeBalancer) Release(b *Backend, rtMs float64) {
	b.active--
	// running mean — simple, but slow to forget (EWMA improves on this)
	b.avgRtMs = (b.avgRtMs*float64(b.samples) + rtMs) / float64(b.samples+1)
	b.samples++
}

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

How does least response time score each server?

question 02 / 03

A server has the fewest active connections but the highest average response time. What does least response time do?

question 03 / 03

Why is a plain running average of response time a weakness, and what addresses it?

0/3 answered