Slow Call Rate — Circuit Breaker

Overview

What this concept solves

Sometimes the downstream isn't broken — it's just slow. A database returning correct rows in 5 seconds will exhaust your thread pool just as effectively as one returning 500s instantly. The classic error-only breaker can't see that: every call returns 200 OK and the breaker stays closed while your service drowns.

The Slow Call Rate variant adds a latency dimension. Each call is timed; any call slower than a configured duration D is counted as 'slow.' If the share of slow calls in the window crosses a rate threshold (default 50%), the breaker trips just as it would for errors.

It's not a replacement for the error-based rule — most production setups run both. Errors trip on outright failure; slow-call-rate trips on degraded performance. Together they catch the two ways a dependency can take your service down.

Mechanics

How it works

The trip rule

On every call, measure the duration (start_time → end_time).
Push the duration into a sliding window (count- or time-based — both work, this prototype uses count-based for clarity).
Count how many durations are ≥ D (the slow threshold).
If that count divided by the window size hits the slow-rate threshold (e.g. 50%), trip.
OPEN, cooldown, HALF-OPEN, trials — the rest is the standard state machine, with the same trip rule (still latency-based) applied to the trial calls.

Choose D from your SLO, not your gut

The slow-duration D should come from what callers consider acceptable, not from what feels round. If your p99 SLO is 500ms, set D somewhere around there — D ≪ 500ms is too sensitive (trips on normal noise), D ≫ 500ms is too lax (trips after you've already broken the SLO). Real implementations often pick D = your timeout / 2.

Errors AND slow-call-rate, in parallel

Resilience4j's recommendation is to enable both. An error rate trip catches outright 5xx; a slow-call-rate trip catches the trickier case where the dependency hangs but eventually returns. If you only enable one, prefer slow-call-rate for any dependency that holds a thread or connection — the slow-call case is the one that actually cascades.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Slow Call Rate

About this simulation

The breaker trips on latency, not errors. Every bar is one call's response time. Calls below the dotted line are fast (green); above it, slow (amber). Once the share of slow calls crosses the rate threshold, the breaker trips — even though every single call 'succeeded'. Drag Downstream latency above the slow line to see it happen on its own.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch the bars colour by speed

Click fast call a few times — short green bars. Click slow call a few times — taller amber bars that overshoot the dotted line. The bar height is the latency; the threshold line and the colour both come from your Slow if over setting.

try 02

Trip on slow alone (no errors)

Click slow call five times in a row. The bottom rate bar climbs into red — every call 'succeeded' but the slow-call rate is 100%. The breaker trips. The point: an error-only breaker would still be sitting at green right now.

try 03

Auto traffic with rising latency

Tick Auto traffic. Slowly raise Downstream latency past the Slow if over value. Watch the bars turn amber and the rate climb. Drop latency back down to recover.

try 04

Tune the slow-duration

Set Downstream latency to a fixed value, e.g. 400ms. Now slide Slow if over from 100ms to 1200ms. Same calls, completely different verdict — when D=100, all are slow; when D=1200, none are. The take-away: this knob is your most important one.

In practice

When to use it — and what you give up

When to add Slow Call Rate to the mix

Synchronous calls that hold a thread — a slow downstream pins your thread pool. The slow-call-rate breaker frees those threads by failing fast.
Connection-pool-bound clients — same idea at the connection layer. A slow database doesn't error, but it holds every connection in your pool until you trip.
Strict latency SLOs — if 'too slow' is itself a contract violation, you'd rather degrade gracefully (fallback, cached value) than serve a slow response.
Downstreams that prefer to hang than fail — some services swallow errors and time out at the load balancer instead of returning fast errors. Latency is the only signal you'll see.

It pairs with timeouts, not replaces them

A breaker is not a timeout. Without a timeout, a slow call still blocks the caller; the breaker only sees its duration after it eventually returns. Always set a per-call timeout first; the breaker then uses the bounded duration distribution to decide when to trip globally.

Pros

Catches the 'slow but not failing' case that error-only breakers miss entirely.
Composes with the error-rate rule — most libraries let you set both thresholds and either trips the breaker.
Cheap to add — just record (duration, was-slow) per call; the comparison and counter are O(1).
Operationally legible — 'we trip if ≥ 50% of calls take longer than 500 ms' is a one-sentence SLO you can show to product.

Cons

Tuning D matters — wrong D means false trips (too low) or missed degradations (too high).
Doesn't capture errors — you still need an error-rate rule (or both rules running in parallel).
Sensitive to one slow tail call in low-volume windows — 1 of 2 calls being slow is 50% slow rate.
Has to wait for the call to return to measure it — a permanently hanging call (no timeout) never registers as 'slow'. Hence the warning above.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

slow_call_rate.go

// Slow-call-rate breaker: trip when too many calls take longer than D ms.
// Same state machine as the error-rate breaker, just a different signal.
package breaker

import (
	"errors"
	"time"
)

var ErrOpen = errors.New("circuit open")

type SlowCallBreaker struct {
	state     string          // "CLOSED" | "OPEN" | "HALF"
	durations []time.Duration // recorded latencies
	trials    []time.Duration
	openUntil time.Time

	size          int
	slow          time.Duration
	rateThreshold float64
	cooldown      time.Duration
	trialCount    int
	minCalls      int
}

func NewSlowCallBreaker() *SlowCallBreaker {
	return &SlowCallBreaker{
		state:         "CLOSED",
		size:          10,
		slow:          500 * time.Millisecond,
		rateThreshold: 0.5,
		cooldown:      5000 * time.Millisecond,
		trialCount:    3,
		minCalls:      5,
	}
}

func (cb *SlowCallBreaker) Call(work func() error) error {
	if cb.state == "OPEN" && !time.Now().Before(cb.openUntil) {
		cb.state = "HALF"
		cb.trials = nil
	}
	if cb.state == "OPEN" {
		return ErrOpen
	}

	start := time.Now()
	err := work()
	cb.record(time.Since(start)) // record even on error
	return err
}

func (cb *SlowCallBreaker) record(d time.Duration) {
	if cb.state == "HALF" {
		cb.trials = append(cb.trials, d)
		if len(cb.trials) >= cb.trialCount {
			if cb.slowRate(cb.trials) >= cb.rateThreshold {
				cb.trip()
			} else {
				cb.close()
			}
		}
		return
	}
	cb.durations = append(cb.durations, d)
	if len(cb.durations) > cb.size {
		cb.durations = cb.durations[1:]
	}
	if len(cb.durations) >= cb.minCalls && cb.slowRate(cb.durations) >= cb.rateThreshold {
		cb.trip()
	}
}

func (cb *SlowCallBreaker) slowRate(arr []time.Duration) float64 {
	slow := 0
	for _, d := range arr {
		if d >= cb.slow {
			slow++
		}
	}
	return float64(slow) / float64(len(arr))
}

func (cb *SlowCallBreaker) trip() {
	cb.state = "OPEN"
	cb.openUntil = time.Now().Add(cb.cooldown)
	cb.trials = nil
}

func (cb *SlowCallBreaker) close() {
	cb.state = "CLOSED"
	cb.durations = nil
	cb.trials = nil
}

References & further reading

5 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

Why isn't an error-rate breaker enough on its own?

question 02 / 03

What's the most important knob in this variant?

question 03 / 03

Why does this rule pair with — not replace — a per-call timeout?

0/3 answered