Intermediate

Circuit Breaker

When a downstream service is failing, stop hammering it — fail fast instead. Six variants, from the state machine itself to the trip-condition tweaks that production resilience libraries actually ship.

distributedresiliencepatterns

Start with The State Machine

What is Circuit Breaker?

The 60-second primer

A circuit breaker stops you from calling something that is already broken. Borrow the metaphor from your house: when a wire is shorting, the breaker pops and cuts the power. It hurts now — your lamps go off — but it prevents the fire that would happen if current kept flowing. Software does the same thing. If your service keeps calling a downstream that is timing out or returning errors, you pile up threads, connections, and queued requests waiting on something that will not answer. The breaker watches the call results, and once they look bad, it just says no to new calls for a while — fast — so the rest of your system stays healthy.

Every breaker has the same three states. Closed means traffic flows normally and the breaker is just watching. Open means the breaker has tripped and every call is rejected instantly without even being attempted. Half-Open is the careful peek in between: after a cooldown, a small number of trial calls are allowed through to see if the downstream has recovered. If those trials succeed, the breaker closes again; if they fail, it opens for another cooldown. That is the whole pattern. The six variants in this topic differ only in which signal trips the breaker and how cleverly it recovers.

If you remember nothing else, remember the goal: fail fast, protect the system, give the downstream a break, and probe gently before letting traffic back in. The rest is engineering choices about what counts as 'looks bad' — error rate? latency? error rate but only if you have enough samples to trust it? — and how cautiously to come back.

Where this shows up

Microservices — service A calls service B which calls service C. If C is dying, every blocked call inside A holds a thread, a TCP connection, and a slot in A's queue. Without a breaker, A dies because C dies. With one, A trips early and stays up to serve the requests that don't need C.
API gateways & service meshes — Envoy, Istio, Linkerd, and AWS App Mesh all ship circuit-breaking at the proxy layer so individual services don't each have to reinvent it.
Client SDKs — official SDKs for cloud services (AWS, GCP, Stripe) embed breakers so a noisy outage in one region doesn't snowball through every call your app makes.
Browser-side requests — even a single-page app can benefit: if the analytics endpoint is down, trip and stop sending events instead of stacking up retries in a queue that blocks the main thread.
Anywhere a slow neighbour can poison the pool — database connection pools, HTTP client pools, thread pools. The breaker frees you from the cascading-failure pattern where one slow dependency takes the entire upstream down.

It's a feedback loop, not a switch

Beginners often picture the breaker as 'check a flag, fail if set.' That is the easy part. The interesting work is measuring: what counts as a failure (just exceptions? slow responses? both?), how big a window, how many samples are enough to trust the number, and how to come back without instantly tripping again. Every variant here picks a different answer for those questions.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

Variant	Trip signal	Best at catching	Recovery style	Use when
01State Machine	(foundation)	The three-state dance itself	`Closed → Open → Half-Open`	Learning the pattern before any trip rule
02Count-Based	Fail rate over last N calls	Sustained error bursts	`Fixed cooldown, then half-open trials`	Steady traffic, simple default
03Time-Based	Fail rate over last N seconds	Self-recovering during quiet periods	`Fixed cooldown, then half-open trials`	Bursty traffic where call counts are uneven
04Slow Call Rate	% of calls slower than D ms	Latency-only failures (200 OK in 5s)	`Fixed cooldown, then half-open trials`	Latency-sensitive paths; downstreams that hang, not error
05Error Percentage	Min volume AND error % both met	Real outages — not 1 fail of 2 calls	`Fixed cooldown, then a single trial call`	Low-volume endpoints; production defaults (Hystrix lineage)
06Adaptive	Error % (closed) + probe results (recovering)	Repeat outages — gets stricter each retry	`Exp-backoff cooldown + 10/25/50/100% ramp`	Flaky dependencies where naïve recovery slams them back open

01State Machine

Trip signal: (foundation)
Best at catching: The three-state dance itself
Recovery style: Closed → Open → Half-Open
Use when: Learning the pattern before any trip rule

02Count-Based

Trip signal: Fail rate over last N calls
Best at catching: Sustained error bursts
Recovery style: Fixed cooldown, then half-open trials
Use when: Steady traffic, simple default

03Time-Based

Trip signal: Fail rate over last N seconds
Best at catching: Self-recovering during quiet periods
Recovery style: Fixed cooldown, then half-open trials
Use when: Bursty traffic where call counts are uneven

04Slow Call Rate

Trip signal: % of calls slower than D ms
Best at catching: Latency-only failures (200 OK in 5s)
Recovery style: Fixed cooldown, then half-open trials
Use when: Latency-sensitive paths; downstreams that hang, not error

05Error Percentage

Trip signal: Min volume AND error % both met
Best at catching: Real outages — not 1 fail of 2 calls
Recovery style: Fixed cooldown, then a single trial call
Use when: Low-volume endpoints; production defaults (Hystrix lineage)

06Adaptive

Trip signal: Error % (closed) + probe results (recovering)
Best at catching: Repeat outages — gets stricter each retry
Recovery style: Exp-backoff cooldown + 10/25/50/100% ramp
Use when: Flaky dependencies where naïve recovery slams them back open

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

How to pick

You're new and just want one → Count-Based with a 50% threshold over the last 10–20 calls. It's the default in many libraries and you can ship it today.
Your traffic is bursty (cron jobs, periodic syncs, sparse user activity) → Time-Based. Quiet minutes shouldn't keep the breaker open forever just because the last few calls happened to fail.
The downstream lies about being healthy by going slow → Slow Call Rate. Errors aren't the only failure mode; a latency spike is enough to exhaust your thread pool.
Low-volume endpoint where one bad call in three would otherwise trip you → Error Percentage. The volume gate is exactly the noise filter you need.
The downstream comes back, dies, comes back, dies → Adaptive. Each re-trip doubles the cooldown and the ramp keeps you from blasting it the moment it shows a pulse.
Use the State Machine first — it's the foundation. Every other variant is a different rule for when to trip; the close-open-half-open dance is identical.

Pair it with the rest

A breaker is not a complete resilience strategy. It works best with timeouts (so a slow call eventually returns failure), retries with jitter (so transient blips don't even reach the breaker), and fallbacks (so a request can still produce something when the breaker is open). The breaker says 'no', but a good system also says 'here's what we can give you instead.'

Concepts in this track

6 concepts, in order

Each links to a concept page with its own explanation, prototype, and quiz.

The State Machine

Closed lets calls through. Open rejects them all. Half-Open peeks at recovery with a handful of trials. Learn the three-state dance before any of the trip-rules.

Beginner9 mintry it

Count-Based

Trip on the failure rate across the last N calls. The simplest sliding window — small, fixed memory, fast to evaluate.

Beginner9 mintry it

Time-Based

Trip on the failure rate over the last N seconds. Old calls age out on their own — quiet periods recover the breaker without any clicks.

Intermediate10 mintry it

Slow Call Rate

Trip on latency, not just errors. A downstream that returns 200 OK in 5 seconds is broken too — and this is the variant that catches it.

Intermediate10 mintry it

Error Percentage (Hystrix)

Two gates: trip only if traffic clears a minimum volume AND the error rate crosses a threshold. The original Netflix Hystrix rule that stops tiny noisy samples from tripping.

Intermediate11 mintry it

Adaptive

Exponential-backoff cooldowns that grow on every re-trip, plus a 10 → 25 → 50 → 100% recovery ramp so the downstream isn't slammed the moment it comes back.

Advanced12 mintry it

Related tracks

If this one clicks, try these next.

Consistent Hashing

Map keys to servers so that adding or removing a server moves as few keys as possible. Five methods, from the classic hash ring to the table-based hashing inside modern network load balancers.

Intermediate5 concepts · 55 min

distributedscalingpatterns

Consensus

How a cluster agrees on a single answer when nodes die, packets drop, and some machines may even lie. Seven algorithms, from the two-phase commit that everyone learns first to the Byzantine-fault-tolerant PBFT.

Advanced7 concepts · 90 min

distributedconsistencypatterns

Leader Election

How a cluster picks one node to be in charge — and how it picks again when that node falls over. Five algorithms, from the textbook ID-based shouting matches to the lease-and-watcher schemes real coordination services ship.

Intermediate5 concepts · 60 min

distributedcoordinationpatterns