Overview
What this concept solves
Least connections assumes that one active connection is as costly as any other. But a connection to a server that answers in 50ms is nothing like a connection to one limping along at 2 seconds. Least response time fixes this by folding measured latency into the decision: it scores each server by how busy it is and how fast it has been answering, then routes to the lowest score.
The score the prototype shows is the classic form used by NGINX Plus and others: (active + 1) × average_response_time. The +1 accounts for the request you're about to add. Multiply by the server's recent average latency and you get an estimate of how long a new request would actually wait there. Pick the smallest estimate — that's least response time.
Mechanics
How it works
Score = (active + 1) × avg_rt
Two signals combine into one number per server:
- Active connections — the live queue depth, exactly like least connections. The
+1models the incoming request. - Average response time — a running average of how long this server's recent requests took to complete.
Their product approximates the expected latency a new request would see: a short queue in front of a fast server scores low and wins; a short queue in front of a slow server scores higher and gets skipped. The prototype renders this as a live equation per server and flags the winner (← lowest, will be picked) and the laggard (avoided).
Where the response time comes from
The average is updated each time a request completes. A plain running mean works, but it reacts slowly and never forgets — a server that was briefly slow an hour ago still carries that weight. That lag is the motivation for the next concept, EWMA, which decays old samples exponentially so the score tracks recent behavior. (This prototype already uses a light exponential update under the hood; EWMA makes the decay rate an explicit, tunable knob.)
Latency is a lagging signal
Response time only updates when requests finish. A server that just fell off a cliff still looks fast until its slow responses start landing — so least response time reacts a beat late to sudden degradation. The active-connection term helps: a stalling server accumulates active connections immediately, nudging its score up before the latency catches up.
Interactive prototype
Run it. Break it. Tune it.
Sandboxed simulation embedded right in the page. No setup, no install.
About this simulation
Least connections, upgraded to also weigh how fast each server answers. The score row spells out the math live — (active + 1) × avg_rt — and the lowest score wins. Crank Server 3's latency multiplier up and watch the LB route around it even when its connection count is low.
Hands-on
Try these on your own
Open the prototype above, run each experiment, predict the answer, then verify.
Read the score equation
Hit 'Auto' and watch the score grid: each row shows (active + 1) × avg_rt = score, and the lowest is tagged ← lowest, will be picked. Note Server 3 starts with a ×3 latency multiplier, so its avg_rt climbs higher than the others — and its row is the one most often marked avoided.
Make a server slow on purpose
Use the 'S1 latency' stepper to push Server 1 up to ×4. Its connection line thins (it's getting less work) and its score climbs. Even when S1 has the fewest active connections, the LB skips it because the latency term dominates — that's the whole difference from plain least connections.
Flatten the fleet
Set every server's latency multiplier back to ×1. Now avg_rt is roughly equal everywhere, so the score collapses to (active + 1) — and the algorithm behaves exactly like Least Connections. Watch the 'RT spread' stat shrink toward zero as the fleet evens out.
In practice
When to use it — and what you give up
When to reach for it
- Backends with genuinely different speeds — mixed hardware, mixed locations, or services where some instances are cold/warm.
- Latency-sensitive traffic — user-facing request paths where p99 matters more than raw even distribution.
- When connection count alone misleads — a server can hold few connections yet be slow; latency-awareness catches that.
- A balancer that can observe response times — you need completion timing, which an L7 proxy or smart client has but a dumb L4 forwarder may not.
Real-world example
NGINX Plus offers least_time (by header or last-byte), and many service meshes default to a latency-aware policy. It's the natural next step once you have response-time telemetry and least-connections is no longer enough.
Pros
- Accounts for server speed, not just queue depth — routes around slow backends automatically.
- Directly targets the thing users feel: expected latency.
- Degrades gracefully — the active-connection term reacts instantly even before latency updates.
- A small, cheap formula on top of the counters least connections already keeps.
Cons
- Response time is a lagging signal; reaction to sudden slowdowns is delayed.
- A plain running average never forgets — stale slow samples linger (EWMA fixes this).
- Requires measuring per-request completion time, which dumb L4 balancers can't do.
- All the distributed-state hazards of least connections still apply.
Reference
Code & further reading
A minimal reference implementation and pointers worth bookmarking.
// Least response time: score = (active + 1) * avgRtSeconds, lowest wins.
type Backend = { id: string; active: number; avgRtMs: number; samples: number };
const DEFAULT_RT_MS = 1500; // used until a server has answered once
class LeastResponseTimeBalancer {
private backends: Backend[];
constructor(ids: string[]) {
this.backends = ids.map((id) => ({ id, active: 0, avgRtMs: 0, samples: 0 }));
}
private score(b: Backend): number {
const rt = b.avgRtMs || DEFAULT_RT_MS;
return (b.active + 1) * (rt / 1000);
}
acquire(): Backend {
let best = this.backends[0];
for (const b of this.backends) {
if (this.score(b) < this.score(best)) best = b;
}
best.active++; // reserve immediately, like least-conn
return best;
}
// Call on completion with the observed round-trip time.
release(b: Backend, rtMs: number): void {
b.active--;
// running mean — simple, but slow to forget (EWMA improves on this)
b.avgRtMs = (b.avgRtMs * b.samples + rtMs) / (b.samples + 1);
b.samples++;
}
}References & further reading
6 sources- Docsdocs.nginx.com
NGINX — Load-balancing methods (least_time)
NGINX Plus's
least_time— score by header or last-byte latency, the production form of this algorithm. - Docsenvoyproxy.io
Envoy — Load balancers (least request)
Envoy biases least-request by active requests and host weights — a close cousin that folds in load and capacity.
- Docstwitter.github.io
Finagle — Clients: load balancing
Finagle's 'least loaded' and Peak EWMA balancers, with a clear discussion of why latency-awareness beats pure connection count.
- Booksre.google
Google SRE Book — Ch. 20: Load Balancing in the Datacenter
Why latency and queueing — not request counts — are the signals that actually protect tail latency.
- Talkinfoq.com
Tyler McMullen — Load Balancing is Impossible (talk)
Why latency-based scoring helps, and the edge cases (herding onto a 'fast' server) it can still hit.
- Articlecloudflare.com
Cloudflare — Types of load balancing algorithms
Plain-English summary of least-response-time among the dynamic methods.
Knowledge check
Did the prototype land?
Quick questions, answers revealed on submit. No scoring saved.
question 01 / 03
How does least response time score each server?
question 02 / 03
A server has the fewest active connections but the highest average response time. What does least response time do?
question 03 / 03
Why is a plain running average of response time a weakness, and what addresses it?
0/3 answered
Was this concept helpful?
Tell us what worked, or what to improve. We read every note.