Intermediate11 min readlive prototype

Error Percentage (Hystrix)

Two gates: trip only if traffic clears a minimum volume AND the error rate crosses a threshold. The original Netflix Hystrix rule that stops tiny noisy samples from tripping.

Overview

What this concept solves

Error Percentage is the rule Netflix Hystrix made famous, and the one every modern resilience library copied. It addresses the most common embarrassment of a naive breaker: tripping when 1 of 2 calls failed because that's technically '50%'. With a million requests, 50% errors mean catastrophe. With two requests, 50% means one customer's pizza was cold.

The fix is a second gate. The breaker tracks call results over a time window (typically seconds, occasionally a count window) and adds a minimum request count below which the error rate is ignored. Only when there are enough samples to trust the number does the breaker check whether the error rate has crossed the percentage threshold. Both conditions must be true to trip; either one alone is not enough.

Half-Open also tends to be simpler in this variant: instead of N trials, many implementations let through exactly one trial request. Pass → close. Fail → re-open. That's it.

Mechanics

How it works

Two conditions, both required

  1. Maintain a rolling window of recent call results (Hystrix used a 10s × 1s bucketed window).
  2. Gate 1 — Volume: count requests in the window. If fewer than the minimum (e.g. 20 in 10 seconds), do nothing. The error rate is statistically meaningless on a small sample.
  3. Gate 2 — Error %: compute fails ÷ total. If this is below the threshold (e.g. 50%), do nothing.
  4. Both gates met → trip. OPEN, cooldown for the wait period.
  5. Cooldown done → HALF-OPEN. Allow exactly one trial. Pass → CLOSED, window reset. Fail → OPEN, cooldown restarts.

Why the volume gate is the most important knob

Almost every breaker incident at scale traces back to a missing or too-low volume gate. A 100% error rate over 1 sample looks identical to a 100% error rate over 10 000 samples to the rate-threshold rule. The volume gate is the only thing standing between you and tripping every single low-traffic endpoint as soon as anyone touches it.

Default values worth knowing

Hystrix's historical defaults: 20 requests / 10 second window, 50% error rate threshold. Resilience4j defaults to minimumNumberOfCalls=100 over a 100-call sliding window with a 50% failure-rate threshold. Both are conservative — pick yours based on your actual baseline traffic, not the defaults.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

About this simulation

Two gates have to be open for the breaker to trip. First, the request volume must clear a minimum — too few samples are too noisy to act on. Only then does the error percentage matter. Try cranking errors to 100% while keeping traffic at 1 req/sec — the breaker stays closed. Then raise traffic and watch both bars cross, and the verdict flip to <em>would trip</em>.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

The volume gate in action

Set Downstream errors to 100% and Traffic rate to 1/s. Tick Auto traffic and wait. The error bar is full red, but the verdict stays at <em>stays closed (too few requests)</em> — the volume gate is blocking the trip. Now raise traffic to 8/s and watch both gates light up green and red, and the verdict flip to <em>tripped open</em>.

try 02

The half-open coin flip

Once tripped, watch for the wait to elapse — the badge goes amber (half-open). Click healthy call — one trial request, the breaker closes immediately. Trip it again; this time click failing call in half-open — it re-opens immediately. Compare with Count-Based, which averages N trials.

try 03

Tune the gate for your traffic

Drop Min requests to 3 with errors at 100% and traffic at 1/s. Now the breaker trips in three seconds. Set min back to 30 — even at 100% errors and 8/s traffic, you wait several seconds. Lower min = faster trips + more false alarms. Higher min = safer + slower to react.

try 04

Errors under threshold

Keep traffic high (gate met). Slide Downstream errors to 30%. The volume gate stays met, but the error gate doesn't — verdict reads <em>stays closed (errors under threshold)</em>. Both must be true to trip. Either alone is not enough.

In practice

When to use it — and what you give up

When this is the right default

  • Almost any production HTTP service — if you don't have a strong reason to pick something else, start here. The volume gate alone makes this the safest default.
  • Low-volume endpoints — admin pages, less-used APIs. Without the volume gate, these trip from a stiff breeze.
  • Multi-tenant services where a single tenant's traffic might be tiny but the aggregate is large — set the minimum to the aggregate, not the per-tenant.
  • When you're inheriting code — most existing breakers in mature codebases are this variant in disguise. Recognise the pattern when you see minimumNumberOfCalls or requestVolumeThreshold.

The volume gate isn't free

While the volume is below the gate, the breaker is blind. A genuinely broken downstream can fail every call and the breaker won't trip until enough requests accumulate. For services where even small outages must trip immediately, lower the gate — and live with the false trips.

Pros

  • Filters out tiny-sample noise — the single biggest source of false trips, gone.
  • Battle-tested defaults — Hystrix's 20-req / 10s / 50% has shipped in production for over a decade.
  • Composes with both Count- and Time-Based windows — most libraries let you pick either and add the volume gate on top.
  • The simplest HALF-OPEN logic — one trial call, two outcomes. Easier to reason about than N-trial averaging.

Cons

  • Slow to react on low-volume endpoints — the volume gate is also a delay before tripping. A genuinely broken service can serve 19 errors before anyone hears the breaker click.
  • One trial is a lot of responsibility — a single unlucky failure during half-open re-opens you. Some libraries make this configurable (Resilience4j) for exactly this reason.
  • More knobs — window, volume, error %, wait — four numbers to tune instead of two. Wrong combinations are easy.
  • Doesn't catch latency degradations — error % is blind to slow successes. Pair with a slow-call-rate rule for downstreams that hang.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

error-percentage.ts
// Hystrix-style breaker: trip only when (a) request volume crosses a minimum
// AND (b) error rate crosses the threshold. The volume gate is the whole point.
class ErrorPercentageBreaker {
  private state: "CLOSED" | "OPEN" | "HALF" = "CLOSED";
  private calls: Array<{ t: number; ok: boolean }> = [];
  private openUntil = 0;

  constructor(
    private windowMs = 10_000,
    private minVolume = 20,           // the gate
    private errorThreshold = 0.5,
    private cooldownMs = 5_000,
  ) {}

  async call<T>(work: () => Promise<T>): Promise<T> {
    if (this.state === "OPEN" && Date.now() >= this.openUntil) this.state = "HALF";
    if (this.state === "OPEN") throw new Error("circuit open");

    try {
      const result = await work();
      this.onResult(true);
      return result;
    } catch (err) {
      this.onResult(false);
      throw err;
    }
  }

  private onResult(ok: boolean) {
    const now = Date.now();

    if (this.state === "HALF") {
      // Single trial call decides everything.
      if (ok) { this.state = "CLOSED"; this.calls = []; }
      else    { this.state = "OPEN"; this.openUntil = now + this.cooldownMs; }
      return;
    }

    this.calls.push({ t: now, ok });
    this.prune(now);

    const total = this.calls.length;
    if (total < this.minVolume) return;           // GATE 1: volume

    const fails = this.calls.filter(c => !c.ok).length;
    if (fails / total < this.errorThreshold) return;  // GATE 2: error %

    this.state = "OPEN";
    this.openUntil = now + this.cooldownMs;
  }

  private prune(now: number) {
    const lo = now - this.windowMs;
    while (this.calls.length && this.calls[0].t < lo) this.calls.shift();
  }
}

References & further reading

5 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. No scoring saved.

question 01 / 03

Why does this variant add a 'minimum number of calls' check on top of the error percentage?

question 02 / 03

Errors are at 100% but traffic is 1 request per second and the minimum is 20 in a 10s window. What happens?

question 03 / 03

How does half-open typically work in this variant?

0/3 answered

Was this concept helpful?

Tell us what worked, or what to improve. We read every note.