Intermediate

Circuit Breaker

When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.

distributedresiliencepatterns

What is Circuit Breaker?

The 60-second primer

A circuit breaker stops you from calling something that is already broken. Borrow the metaphor from your house: when a wire is shorting, the breaker pops and cuts the power. It hurts now — your lamps go off — but it prevents the fire that would happen if current kept flowing. Software does the same thing. If your service keeps calling a downstream that is timing out or returning errors, you pile up threads, connections, and queued requests waiting on something that will not answer. The breaker watches the call results, and once they look bad, it just says no to new calls for a while — fast — so the rest of your system stays healthy.

Every breaker has the same three states. Closed means traffic flows normally and the breaker is just watching. Open means the breaker has tripped and every call is rejected instantly without even being attempted. Half-Open is the careful peek in between: after a cooldown, a small number of trial calls are allowed through to see if the downstream has recovered. If those trials succeed, the breaker closes again; if they fail, it opens for another cooldown. That is the whole pattern. The six variants in this topic differ only in which signal trips the breaker and how cleverly it recovers.

If you remember nothing else, remember the goal: fail fast, protect the system, give the downstream a break, and probe gently before letting traffic back in. The rest is engineering choices about what counts as 'looks bad' — error rate? latency? error rate but only if you have enough samples to trust it? — and how cautiously to come back.

Where this shows up

  • Microservices — service A calls service B which calls service C. If C is dying, every blocked call inside A holds a thread, a TCP connection, and a slot in A's queue. Without a breaker, A dies because C dies. With one, A trips early and stays up to serve the requests that don't need C.
  • API gateways & service meshes — Envoy, Istio, Linkerd, and AWS App Mesh all ship circuit-breaking at the proxy layer so individual services don't each have to reinvent it.
  • Client SDKs — official SDKs for cloud services (AWS, GCP, Stripe) embed breakers so a noisy outage in one region doesn't snowball through every call your app makes.
  • Browser-side requests — even a single-page app can benefit: if the analytics endpoint is down, trip and stop sending events instead of stacking up retries in a queue that blocks the main thread.
  • Anywhere a slow neighbour can poison the pool — database connection pools, HTTP client pools, thread pools. The breaker frees you from the cascading-failure pattern where one slow dependency takes the entire upstream down.

It's a feedback loop, not a switch

Beginners often picture the breaker as 'check a flag, fail if set.' That is the easy part. The interesting work is measuring: what counts as a failure (just exceptions? slow responses? both?), how big a window, how many samples are enough to trust the number, and how to come back without instantly tripping again. Every variant here picks a different answer for those questions.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

01State Machine
Trip signal
(foundation)
Best at catching
The three-state dance itself
Recovery style
Closed → Open → Half-Open
Use when
Learning the pattern before any trip rule
02Count-Based
Trip signal
Fail rate over last N calls
Best at catching
Sustained error bursts
Recovery style
Fixed cooldown, then half-open trials
Use when
Steady traffic, simple default
03Time-Based
Trip signal
Fail rate over last N seconds
Best at catching
Self-recovering during quiet periods
Recovery style
Fixed cooldown, then half-open trials
Use when
Bursty traffic where call counts are uneven
04Slow Call Rate
Trip signal
% of calls slower than D ms
Best at catching
Latency-only failures (200 OK in 5s)
Recovery style
Fixed cooldown, then half-open trials
Use when
Latency-sensitive paths; downstreams that hang, not error
05Error Percentage
Trip signal
Min volume AND error % both met
Best at catching
Real outages — not 1 fail of 2 calls
Recovery style
Fixed cooldown, then a single trial call
Use when
Low-volume endpoints; production defaults (Hystrix lineage)
06Adaptive
Trip signal
Error % (closed) + probe results (recovering)
Best at catching
Repeat outages — gets stricter each retry
Recovery style
Exp-backoff cooldown + 10/25/50/100% ramp
Use when
Flaky dependencies where naïve recovery slams them back open

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

How to pick

  • You're new and just want oneCount-Based with a 50% threshold over the last 10–20 calls. It's the default in many libraries and you can ship it today.
  • Your traffic is bursty (cron jobs, periodic syncs, sparse user activity) → Time-Based. Quiet minutes shouldn't keep the breaker open forever just because the last few calls happened to fail.
  • The downstream lies about being healthy by going slowSlow Call Rate. Errors aren't the only failure mode; a latency spike is enough to exhaust your thread pool.
  • Low-volume endpoint where one bad call in three would otherwise trip you → Error Percentage. The volume gate is exactly the noise filter you need.
  • The downstream comes back, dies, comes back, diesAdaptive. Each re-trip doubles the cooldown and the ramp keeps you from blasting it the moment it shows a pulse.
  • Use the State Machine first — it's the foundation. Every other variant is a different rule for when to trip; the close-open-half-open dance is identical.

Pair it with the rest

A breaker is not a complete resilience strategy. It works best with timeouts (so a slow call eventually returns failure), retries with jitter (so transient blips don't even reach the breaker), and fallbacks (so a request can still produce something when the breaker is open). The breaker says 'no', but a good system also says 'here's what we can give you instead.'

Related tracks

If this one clicks, try these next.