Overview
What this concept solves
An exponentially weighted moving average (EWMA) is how you give a load balancer a memory with a fading edge. Each new latency sample is blended into a running average with weight α, while everything before it decays by (1 − α). Recent behavior dominates; ancient history quietly fades out. This is the statistical core of every modern adaptive balancer — Twitter's Finagle, Linkerd, and gRPC's latency-aware policies all route on an EWMA of response time.
The 'Peak' variant used here (from Finagle) keeps the same EWMA × (active + 1) score as least response time, but the EWMA makes the latency term react quickly to spikes and recover only cautiously. A server that gets slow is penalized fast; once it recovers, its score eases back down gradually rather than snapping, so you don't immediately stampede back onto a server that just had a bad moment.
Mechanics
How it works
The one-line recurrence
On every completed request, fold the new sample into the average:
ewma = α · sample + (1 − α) · ewma- α (alpha) is the smoothing factor, 0 < α < 1. High α → the latest sample matters a lot → reactive but jumpy. Low α → heavy smoothing → stable but slow to notice change.
- (1 − α) is how much of the old average survives each step. Apply it repeatedly and old samples decay geometrically — that's the 'exponential' in EWMA.
- Half-life ≈ how many samples until an old observation's weight halves:
ln(0.5) / ln(1 − α). The prototype shows this live as you move the slider.
Why multiply by (active + 1)?
EWMA alone tells you how fast a server has been. Multiplying by the current outstanding-request count (plus one for the new request) turns it into an estimate of how long the next request will take given the queue in front of it. This is the same scoring shape as least response time — EWMA just supplies a much better latency number: one that forgets stale data at a rate you control.
Tuning α is the whole game
Too high and the balancer chases noise, flapping traffic on every random slow request. Too low and it's sluggish, sending requests to a server that went bad seconds ago. The right α depends on your request rate and how spiky your latencies are — which is exactly what the slider lets you feel.
Interactive prototype
Run it. Break it. Tune it.
Sandboxed simulation embedded right in the page. No setup, no install.
About this simulation
Peak EWMA: each server's response time is smoothed into an exponentially-decaying average, plotted live per server. The pick score is EWMA × (active + 1). Drag the α slider to watch the curves go from slow-and-smooth (low α) to twitchy-and-reactive (high α), and see the half-life stat change with it.
Hands-on
Try these on your own
Open the prototype above, run each experiment, predict the answer, then verify.
Watch the curves settle
Hit 'Auto' and watch the live chart. Server 3 starts at ×3 latency, so its line (red) climbs well above the others and the LB sends it less work. Each server's EWMA is exactly the number the score row multiplies by (active + 1) — the chart is what the algorithm sees.
Slide α from smooth to jumpy
Drag the α slider to 0.05: the lines go glassy-smooth and the 'EWMA half-life' stat jumps to ~13 — old samples linger, so the balancer reacts slowly. Now push α to 0.90: the lines get jagged, half-life drops near ~0.3, and routing flips quickly on every spike. Find the α where it tracks change without flapping.
Create and heal a hotspot
While it's running, raise Server 1's latency to ×4. Its EWMA line climbs and the LB drains traffic off it within a few samples. Drop it back to ×1 and watch how gently the line — and the traffic — return: Peak EWMA recovers cautiously, not instantly. Compare that lag at low vs. high α.
In practice
When to use it — and what you give up
When to reach for it
- Adaptive, latency-aware balancing at scale — the default in modern RPC stacks (Finagle, Linkerd, gRPC) and service meshes.
- Noisy environments — shared hardware, GC pauses, JIT warmup; EWMA smooths transient blips instead of overreacting to each one.
- Paired with power-of-two-choices — score two random servers by their EWMA and take the better one; nearly stateless yet sharply latency-aware.
- When you need to react to degradation faster than a plain average allows — Peak EWMA penalizes slowdowns quickly and recovers slowly, protecting tail latency.
Real-world example
Twitter's Finagle ships p2cPeakEwma: power-of-two-choices over a Peak-EWMA latency score, with a configurable decay time. Linkerd's proxy uses the same idea. It's arguably the most-deployed 'smart' load-balancing policy in microservice infrastructure.
Pros
- Tracks recent latency while forgetting stale data at a tunable rate — the best of running-average worlds.
- O(1) memory and compute per server: one number, one update per request.
- Peak variant reacts fast to slowdowns and recovers cautiously, protecting tail latency.
- Composes beautifully with power-of-two-choices for near-stateless adaptive balancing.
Cons
- α must be tuned — wrong values either chase noise or react too slowly.
- Still a lagging signal: EWMA only updates when requests complete.
- More moving parts to reason about and debug than a stateless counter.
- Per-balancer EWMAs can disagree across a distributed fleet, like any local-state scheme.
Reference
Code & further reading
A minimal reference implementation and pointers worth bookmarking.
// Peak EWMA: smooth recent latency, score by ewma * (active + 1).
type Backend = { id: string; active: number; ewmaMs: number };
class PeakEwmaBalancer {
private backends: Backend[];
constructor(ids: string[], private alpha = 0.3, private defaultRtMs = 1500) {
this.backends = ids.map((id) => ({ id, active: 0, ewmaMs: 0 }));
}
private score(b: Backend): number {
const rt = b.ewmaMs || this.defaultRtMs;
return (rt / 1000) * (b.active + 1);
}
acquire(): Backend {
let best = this.backends[0];
for (const b of this.backends) {
if (this.score(b) < this.score(best)) best = b;
}
best.active++;
return best;
}
// Fold the observed RTT into the exponentially-weighted average.
release(b: Backend, rtMs: number): void {
b.active--;
b.ewmaMs = b.ewmaMs === 0
? rtMs
: this.alpha * rtMs + (1 - this.alpha) * b.ewmaMs;
}
}
// half-life in samples ≈ ln(0.5) / ln(1 - alpha)References & further reading
6 sources- Docstwitter.github.io
Finagle — Clients: Peak EWMA load balancer
The canonical Peak-EWMA description: a peak-sensitive moving average of RTT, weighted by outstanding requests, with a decay time.
- Articleinfoq.com
InfoQ — How Twitter Improves Resource Usage with a Deterministic Load Balancing Algorithm
How Twitter evolved its P2C + EWMA balancing (aperture) in production, and what problems it solved.
- Papereecs.harvard.edu
Mitzenmacher — The Power of Two Choices in Randomized Load Balancing
The P2C result that EWMA scoring is almost always paired with in practice.
- Docsenvoyproxy.io
Envoy — Load balancers
Envoy's least-request and slow-start behaviors — the same latency-aware family in a different proxy.
- Articleen.wikipedia.org
Wikipedia — Moving average (exponential)
The EWMA recurrence, α, and how the weights decay geometrically — the math under the hood.
- Talkinfoq.com
Tyler McMullen — Load Balancing is Impossible (talk)
Why time-decayed latency signals plus randomized choice beat both pure random and naive least-conns.
Knowledge check
Did the prototype land?
Quick questions, answers revealed on submit. No scoring saved.
question 01 / 03
In `ewma = α·sample + (1 − α)·ewma`, what does a *high* α do?
question 02 / 03
Why does Peak EWMA multiply the smoothed latency by (active + 1)?
question 03 / 03
What's the main advantage of EWMA over a plain running average of response time?
0/3 answered
Was this concept helpful?
Tell us what worked, or what to improve. We read every note.