Overview
What this concept solves
Every breaker so far has the same two failure modes during recovery. Fail mode A: the downstream isn't ready yet — the cooldown was too short, the trial calls fail, the breaker re-opens, and you wasted the cooldown for nothing. Fail mode B: the downstream is barely ready — the cooldown was long enough but the trial calls were the lull before another storm; you re-open seconds later. Repeat ad nauseam.
The Adaptive variant fixes both with two mechanisms. Exponential backoff on the cooldown — every re-trip doubles the wait (up to a cap), so a serially-failing dependency gets longer and longer breathing room. And gradual ramp-up during recovery — instead of jumping from 0% to 100% traffic, admit 10% first, then 25%, then 50%, then 100%, and gate each step on a small batch of healthy probes.
It's the breaker AWS, Google, and the big libraries reach for when a fixed cooldown isn't cutting it. More moving parts, more knobs, but the only variant that doesn't keep slamming a recovering downstream the moment it shows a pulse.
Mechanics
How it works
Mechanism 1 — Exponential backoff
- Start with a base cooldown (e.g. 3 seconds) and a re-trip streak counter at 0.
- On every trip, set the cooldown to
min(base × 2^streak, max). First trip: 3s. Second: 6s. Third: 12s. Capped (e.g. 24s) so it doesn't run away. - Reset the streak to 0 only on a full recovery — i.e. when traffic has reached 100% admission and stayed healthy. A partial recovery that re-trips counts as another re-trip.
- In production, add jitter (Adaptive's untested cousin): cooldown = min(base × 2^streak, max) × random(0.5..1.0). Stops every caller from probing the downstream in lockstep after a wide outage.
Mechanism 2 — Traffic ramp recovery
- After the cooldown, instead of HALF-OPEN with N trials, enter a RECOVERING state with an admission level — start at 10%.
- Each incoming call: roll a die; admit it with probability = current admission level. Hold the rest back (count as 'rejected' but no penalty to the downstream).
- Track the admitted calls. After K consecutive healthy admitted calls (e.g. 2), promote to the next level: 10% → 25% → 50% → 100%.
- At 100% with K healthy in a row, CLOSE. Streak counter resets to 0.
- Any failure during RECOVERING — at any level — re-OPENs the breaker, with the next-bigger cooldown.
Why this combination matters
Without backoff, a flaky downstream that briefly recovers will cause you to re-trip every few seconds — same cooldown, same instant slam, same failure. Without ramp recovery, you risk sending full production traffic to a service that just woke up and is still warming caches. The two mechanisms cover the two failure modes; either alone leaves you exposed.
Inspirations
Exponential backoff with jitter comes straight from the AWS Builders' Library. The gradual ramp is the same idea as AWS Application Load Balancer's slow_start mode, Envoy's panic_threshold thresholds, and Linkerd's adaptive concurrency — all of which slowly increase admitted load to a recovering target.
Interactive prototype
Run it. Break it. Tune it.
Sandboxed simulation embedded right in the page. No setup, no install.
About this simulation
Two upgrades over the fixed-cooldown breakers. Cooldowns grow with every re-trip — 3s → 6s → 12s → 24s — so a flaky downstream isn't slammed back to life. And recovery isn't all-or-nothing: traffic is admitted in stages (10% → 25% → 50% → 100%), each stage gated by a few healthy probe calls. One failure during recovery throws you back to OPEN with the next bigger cooldown.
Hands-on
Try these on your own
Open the prototype above, run each experiment, predict the answer, then verify.
Watch the cooldown double
Click failing call until the breaker trips (default 50% over the last 12). It opens with a 3s cooldown. Wait until it recovers to half-open / recovering. Click failing call again to force a re-trip — now the cooldown is 6s. Do it again: 12s. The 'streak' counter in the cooldown card shows what you're paying for the flapping.
Walk the recovery ramp
Trip the breaker once, let the cooldown elapse. The badge shows recovering at 10%. Click healthy call twice — the breaker promotes to 25%. Two more healthy → 50%. Two more → 100%. Two more → fully closed, streak resets to 0. The ramp is your insurance policy against flooding a freshly recovered service.
One slip and back to open
Trip → recover to 25% with a couple of healthy calls. Now click failing call once. The breaker slams back to OPEN with the next-bigger cooldown — it doesn't average, it doesn't forgive. The single-failure rule applies at every ramp level, not just at 10%.
Auto traffic with a flaky downstream
Tick Auto traffic and set Downstream errors to 60% (high but not 100). Watch the breaker repeatedly trip, recover partially, and re-trip — each cycle waiting longer. Eventually the cooldown sits at the max and probes get sparse. Drop errors to 0 and watch the ramp finally complete; the streak resets to 0, cooldown drops back to base.
In practice
When to use it — and what you give up
When to upgrade to Adaptive
- Flaky dependencies that recover, then re-fail — fixed-cooldown breakers waste your time with this pattern. Adaptive cooldowns space probes further and further apart.
- Downstreams with cold starts — services that need warm caches, JIT compilation, or connection pools to be at full speed. The 10/25/50/100% ramp gives them time to fully wake up.
- Multi-tenant or shared infrastructure — if your breaker tripping floods another service when it un-trips, the ramp limits the inrush spike.
- Production at scale — when fixed-cooldown breakers have caused you an incident before, this is what to roll out next.
- Avoid if your service is simple enough that exp-backoff is overkill — three knobs to tune vs. one. Start with Error-Percentage; upgrade if you see the failure modes above.
More knobs means more ways to misconfigure
Base cooldown, max cooldown, ramp levels, probes-per-step — that's four numbers, all interdependent. Pick wrong (e.g. base = max) and you reduce to a fixed-cooldown breaker. Test the behaviour with the prototype and a chaos scenario before shipping.
Pros
- Doesn't slam recovering downstreams — the ramp gives them time to warm up.
- Handles flapping — exp-backoff means re-failures get longer cooldowns, eventually waiting out the outage.
- Self-tuning, in a sense — you set the bounds; the cooldown picks its own length within them based on observed re-trips.
- Mirrors what big systems do — ALB slow-start, Envoy panic mode, Linkerd adaptive concurrency. Knowing this variant is knowing the production pattern.
Cons
- More state and more knobs — base, max, ramp levels, probes per level. Tuning is non-trivial.
- Rejects some traffic during recovery — at 10% admission, 90% of calls are held back. Pair with a fallback or queue if those callers need a response.
- Cooldowns can get long — a service that flaps 4 times can be locked out for 24+ seconds even though it's now healthy. Set a max you can live with.
- Harder to reason about — operators looking at a dashboard need to understand 'recovering at 25%' vs the simpler 'open' or 'closed'.
Reference
Code & further reading
A minimal reference implementation and pointers worth bookmarking.
// Adaptive breaker: exp-backoff cooldown + ramped recovery.
type State = "CLOSED" | "OPEN" | "RECOVERING";
class AdaptiveBreaker {
private state: State = "CLOSED";
private calls: boolean[] = [];
private streak = 0; // re-trip count, drives backoff
private cooldownMs;
private openUntil = 0;
private rampIdx = 0;
private rampOk = 0;
// 10 → 25 → 50 → 100% admission during recovery
private static ADMIT = [10, 25, 50, 100];
constructor(
private size = 12,
private threshold = 0.5,
private baseMs = 3_000,
private maxMs = 24_000,
private probesPerStep = 2,
) { this.cooldownMs = baseMs; }
async call<T>(work: () => Promise<T>): Promise<T> {
if (this.state === "OPEN" && Date.now() >= this.openUntil) this.toRecovering();
if (this.state === "OPEN") throw new Error("circuit open");
if (this.state === "RECOVERING") {
// Admit a probabilistic slice; the rest are short-circuited.
const admit = AdaptiveBreaker.ADMIT[this.rampIdx];
if (Math.random() * 100 >= admit) throw new Error("recovering · held back");
}
try {
const result = await work();
this.onResult(true);
return result;
} catch (err) {
this.onResult(false);
throw err;
}
}
private onResult(ok: boolean) {
if (this.state === "RECOVERING") {
if (!ok) { this.trip(); return; }
this.rampOk++;
if (this.rampOk >= this.probesPerStep) {
if (this.rampIdx < AdaptiveBreaker.ADMIT.length - 1) {
this.rampIdx++; this.rampOk = 0;
} else { this.close(); }
}
return;
}
this.calls.push(ok);
if (this.calls.length > this.size) this.calls.shift();
const fails = this.calls.filter(c => !c).length;
if (this.calls.length >= Math.min(5, this.size) && fails / this.calls.length >= this.threshold) {
this.trip();
}
}
private trip() {
this.streak++;
this.cooldownMs = Math.min(this.baseMs * 2 ** (this.streak - 1), this.maxMs);
this.state = "OPEN";
this.openUntil = Date.now() + this.cooldownMs;
this.rampIdx = 0; this.rampOk = 0;
}
private toRecovering() { this.state = "RECOVERING"; this.rampIdx = 0; this.rampOk = 0; }
private close() {
this.state = "CLOSED"; this.calls = []; this.streak = 0;
this.cooldownMs = this.baseMs; this.rampIdx = 0; this.rampOk = 0;
}
}References & further reading
6 sources- Articleaws.amazon.com
AWS Builders' Library — Timeouts, retries, and backoff with jitter
Marc Brooker's piece on why exponential backoff plus jitter is the only honest answer to flapping services. The exp-backoff part of this variant comes from here.
- Docsdocs.aws.amazon.com
AWS Application Load Balancer — slow_start_duration
The 'gradually ramp traffic' idea, productionised at the load-balancer layer. Same concept as this breaker's recovery ramp, applied to new targets after deploy or recovery.
- Docsenvoyproxy.io
Envoy — panic threshold & outlier ejection
Envoy's adaptive behaviour during partial outages. The 'panic threshold' is a different mechanism but solves the same 'don't trust a single recovering host' problem.
- Docslinkerd.io
Linkerd — adaptive concurrency control
Service-mesh take on the same idea: instead of binary open/closed, dynamically adjust how much load each upstream gets based on its recent latency.
- Docsresilience4j.readme.io
Resilience4j — slowCallRateThreshold + allowedNumberOfCallsInHalfOpenState
Resilience4j doesn't ship full ramp recovery but its half-open knobs give you most of the way there. Worth reading to see how libraries approximate adaptive behaviour.
- Docspollydocs.org
Polly — Advanced circuit-breaker + retry policies
.NET resilience library. The 'combining strategies' guide shows how to stack a breaker with retry-with-exp-backoff to get effectively adaptive behaviour.
Knowledge check
Did the prototype land?
Quick questions, answers revealed on submit. No scoring saved.
question 01 / 03
Why does the cooldown grow on every re-trip?
question 02 / 03
During recovery, the breaker is at the 50% admission level when a single call fails. What happens?
question 03 / 03
Why is the ramp (10 → 25 → 50 → 100%) better than a single threshold of trial calls?
0/3 answered
Was this concept helpful?
Tell us what worked, or what to improve. We read every note.