Circuit Breaker
When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.
What is Circuit Breaker?
The 60-second primer
A circuit breaker stops you from calling something that is already broken. Borrow the metaphor from your house: when a wire is shorting, the breaker pops and cuts the power. It hurts now — your lamps go off — but it prevents the fire that would happen if current kept flowing. Software does the same thing. If your service keeps calling a downstream that is timing out or returning errors, you pile up threads, connections, and queued requests waiting on something that will not answer. The breaker watches the call results, and once they look bad, it just says no to new calls for a while — fast — so the rest of your system stays healthy.
Every breaker has the same three states. Closed means traffic flows normally and the breaker is just watching. Open means the breaker has tripped and every call is rejected instantly without even being attempted. Half-Open is the careful peek in between: after a cooldown, a small number of trial calls are allowed through to see if the downstream has recovered. If those trials succeed, the breaker closes again; if they fail, it opens for another cooldown. That is the whole pattern. The six variants in this topic differ only in which signal trips the breaker and how cleverly it recovers.
If you remember nothing else, remember the goal: fail fast, protect the system, give the downstream a break, and probe gently before letting traffic back in. The rest is engineering choices about what counts as 'looks bad' — error rate? latency? error rate but only if you have enough samples to trust it? — and how cautiously to come back.
Where this shows up
- Microservices — service A calls service B which calls service C. If C is dying, every blocked call inside A holds a thread, a TCP connection, and a slot in A's queue. Without a breaker, A dies because C dies. With one, A trips early and stays up to serve the requests that don't need C.
- API gateways & service meshes — Envoy, Istio, Linkerd, and AWS App Mesh all ship circuit-breaking at the proxy layer so individual services don't each have to reinvent it.
- Client SDKs — official SDKs for cloud services (AWS, GCP, Stripe) embed breakers so a noisy outage in one region doesn't snowball through every call your app makes.
- Browser-side requests — even a single-page app can benefit: if the analytics endpoint is down, trip and stop sending events instead of stacking up retries in a queue that blocks the main thread.
- Anywhere a slow neighbour can poison the pool — database connection pools, HTTP client pools, thread pools. The breaker frees you from the cascading-failure pattern where one slow dependency takes the entire upstream down.
It's a feedback loop, not a switch
Beginners often picture the breaker as 'check a flag, fail if set.' That is the easy part. The interesting work is measuring: what counts as a failure (just exceptions? slow responses? both?), how big a window, how many samples are enough to trust the number, and how to come back without instantly tripping again. Every variant here picks a different answer for those questions.
Side-by-side
How they compare
The same concepts, on the same axes. Use this as a map; the individual pages are the territory.
| Variant | Trip signal | Best at catching | Recovery style | Use when |
|---|---|---|---|---|
01State Machine | (foundation) | The three-state dance itself | Closed → Open → Half-Open | Learning the pattern before any trip rule |
02Count-Based | Fail rate over last N calls | Sustained error bursts | Fixed cooldown, then half-open trials | Steady traffic, simple default |
03Time-Based | Fail rate over last N seconds | Self-recovering during quiet periods | Fixed cooldown, then half-open trials | Bursty traffic where call counts are uneven |
04Slow Call Rate | % of calls slower than D ms | Latency-only failures (200 OK in 5s) | Fixed cooldown, then half-open trials | Latency-sensitive paths; downstreams that hang, not error |
05Error Percentage | Min volume AND error % both met | Real outages — not 1 fail of 2 calls | Fixed cooldown, then a single trial call | Low-volume endpoints; production defaults (Hystrix lineage) |
06Adaptive | Error % (closed) + probe results (recovering) | Repeat outages — gets stricter each retry | Exp-backoff cooldown + 10/25/50/100% ramp | Flaky dependencies where naïve recovery slams them back open |
- Trip signal
- (foundation)
- Best at catching
- The three-state dance itself
- Recovery style
Closed → Open → Half-Open- Use when
- Learning the pattern before any trip rule
- Trip signal
- Fail rate over last N calls
- Best at catching
- Sustained error bursts
- Recovery style
Fixed cooldown, then half-open trials- Use when
- Steady traffic, simple default
- Trip signal
- Fail rate over last N seconds
- Best at catching
- Self-recovering during quiet periods
- Recovery style
Fixed cooldown, then half-open trials- Use when
- Bursty traffic where call counts are uneven
- Trip signal
- % of calls slower than D ms
- Best at catching
- Latency-only failures (200 OK in 5s)
- Recovery style
Fixed cooldown, then half-open trials- Use when
- Latency-sensitive paths; downstreams that hang, not error
- Trip signal
- Min volume AND error % both met
- Best at catching
- Real outages — not 1 fail of 2 calls
- Recovery style
Fixed cooldown, then a single trial call- Use when
- Low-volume endpoints; production defaults (Hystrix lineage)
- Trip signal
- Error % (closed) + probe results (recovering)
- Best at catching
- Repeat outages — gets stricter each retry
- Recovery style
Exp-backoff cooldown + 10/25/50/100% ramp- Use when
- Flaky dependencies where naïve recovery slams them back open
Decision guide
Which one should you use?
A practical tour of when each algorithm wins.
How to pick
- You're new and just want one → Count-Based with a 50% threshold over the last 10–20 calls. It's the default in many libraries and you can ship it today.
- Your traffic is bursty (cron jobs, periodic syncs, sparse user activity) → Time-Based. Quiet minutes shouldn't keep the breaker open forever just because the last few calls happened to fail.
- The downstream lies about being healthy by going slow → Slow Call Rate. Errors aren't the only failure mode; a latency spike is enough to exhaust your thread pool.
- Low-volume endpoint where one bad call in three would otherwise trip you → Error Percentage. The volume gate is exactly the noise filter you need.
- The downstream comes back, dies, comes back, dies → Adaptive. Each re-trip doubles the cooldown and the ramp keeps you from blasting it the moment it shows a pulse.
- Use the State Machine first — it's the foundation. Every other variant is a different rule for when to trip; the close-open-half-open dance is identical.
Pair it with the rest
A breaker is not a complete resilience strategy. It works best with timeouts (so a slow call eventually returns failure), retries with jitter (so transient blips don't even reach the breaker), and fallbacks (so a request can still produce something when the breaker is open). The breaker says 'no', but a good system also says 'here's what we can give you instead.'
Concepts in this track
6 concepts, in order
Each links to a concept page with its own explanation, prototype, and quiz.
The State Machine
Closed lets calls through. Open rejects them all. Half-Open peeks at recovery with a handful of trials. Learn the three-state dance before any of the trip-rules.
Count-Based
Trip on the failure rate across the last N calls. The simplest sliding window — small, fixed memory, fast to evaluate.
Time-Based
Trip on the failure rate over the last N seconds. Old calls age out on their own — quiet periods recover the breaker without any clicks.
Slow Call Rate
Trip on latency, not just errors. A downstream that returns 200 OK in 5 seconds is broken too — and this is the variant that catches it.
Error Percentage (Hystrix)
Two gates: trip only if traffic clears a minimum volume AND the error rate crosses a threshold. The original Netflix Hystrix rule that stops tiny noisy samples from tripping.
Adaptive
Exponential-backoff cooldowns that grow on every re-trip, plus a 10 → 25 → 50 → 100% recovery ramp so the downstream isn't slammed the moment it comes back.
Related tracks
If this one clicks, try these next.
Consistent Hashing
Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.
Rate Limiting
Control request throughput so a noisy client cannot starve everyone else. Compare the five canonical algorithms side-by-side.
Cache Write Policies
Three ways to handle a write when you have a cache in front of the store. Each policy is a different bet about durability, throughput, and how stale your data is allowed to get.