Intermediate

Rate Limiting

Control request throughput so a noisy client cannot starve everyone else. Compare the five canonical algorithms side-by-side.

distributedapithroughput

What is Rate Limiting?

The 60-second primer

Rate limiting is the practice of capping how many requests a client can make to a system in a given time window. Every API at scale uses it. It's the difference between a service that politely tells abusive clients to slow down and one that falls over when traffic spikes.

The contract is simple: every incoming request is checked against a quota. If the client is under the limit, the request goes through. If not, it's rejected — typically with an HTTP 429 Too Many Requests response and a header telling the client when to try again.

The interesting part isn't the contract — it's how you decide whether the limit was hit. There are five canonical algorithms, and each one trades off a different property: bursts vs. smoothness, memory vs. precision, simplicity vs. boundary correctness. Pick the wrong one and you either invite abuse or alienate your real users.

Why every API at scale needs it

  • Protect against abuse — brute-force login attempts, credential stuffing, scrapers hammering endpoints.
  • Fairness across clients — one noisy customer shouldn't starve the rest. Rate limits enforce a per-client share.
  • Cost control — when each request costs you money (a downstream API call, an LLM token, a database query), an unbounded client is unbounded spend.
  • Protect downstream systems — your database has a connection limit. Your payment processor has a TPS cap. Rate limiting at the edge stops chaos from reaching them.
  • Predictable capacity planning — if every client is bounded, your total load is bounded.

Where it lives

Rate limiters almost always sit at the edge — API gateway, reverse proxy, or service mesh — keyed on something like API key, user ID, or IP address. Putting it deeper in the stack means the work is already happening; putting it at the edge means the bad request never costs you anything.

Side-by-side

How they compare

The same concepts, on the same axes. Use this as a map; the individual pages are the territory.

01Token Bucket
Bursts
Allowed (up to capacity)
Precision
Exact
Memory
O(1) — 2 counters
Best for
Public APIs with friendly bursts
02Leaky Bucket
Bursts
Smoothed away
Precision
Exact
Memory
O(B) — bounded queue
Best for
Protecting slow downstreams
03Fixed Window Counter
Bursts
Up to 2× at boundaries
Precision
Coarse
Memory
O(1) — 1 counter
Best for
Hourly/daily quotas, simplicity
04Sliding Window Counter
Bursts
Smoothed
Precision
Approximate (~1% error)
Memory
O(1) — 2 counters
Best for
Edge networks at scale
05Sliding Window Log
Bursts
Smoothed
Precision
Exact (timestamp-level)
Memory
O(N) — every timestamp
Best for
Auth, low-rate security limits

Decision guide

Which one should you use?

A practical tour of when each algorithm wins.

Decision guide

  • You want bursts but a sustained capToken Bucket. The default choice for almost any public API.
  • You must protect a slow downstream from any spikeLeaky Bucket. Output is constant by construction.
  • You need a simple per-hour or per-day quotaFixed Window Counter. One Redis INCR per request.
  • You operate at edge scale and need smoothness without growing memorySliding Window Counter.
  • Precision matters more than throughput — login attempts, password resets, financial caps → Sliding Window Log.

When in doubt, start with Token Bucket

It's the most-deployed algorithm on the public internet for a reason. Easy to implement, easy to communicate ("X req/s with bursts up to Y"), and friendly to real user traffic.

Related tracks

If this one clicks, try these next.