LLM rate limits explained

How RPM, TPM, and tier-based limits actually work across OpenAI, Anthropic, Groq, and others — and how to read the headers.

Published 1/19/2026

Every LLM provider rate-limits you. The mechanism, the headers, and the granularity differ. Reading the actual headers from your responses will tell you more than any docs page.

The two units

Almost every provider rate-limits on two axes:

RPM — Requests Per Minute. How many HTTP calls you can make.
TPM — Tokens Per Minute. How many tokens (input + output combined) you can move.

You hit whichever ceiling comes first. A handful of small chats hit the RPM cap; a long retrieval prompt hits the TPM cap on a single call.

The headers

OpenAI and OpenAI-compatible providers (Groq, Together, Fireworks, OpenRouter) return:

x-ratelimit-limit-requests: 5000
x-ratelimit-remaining-requests: 4998
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 200000
x-ratelimit-remaining-tokens: 198432
x-ratelimit-reset-tokens: 6m0s

Anthropic uses a different naming convention:

anthropic-ratelimit-tokens-limit: 200000
anthropic-ratelimit-tokens-remaining: 198432
anthropic-ratelimit-tokens-reset: 2026-01-15T12:34:56Z

Use the rate-limit calculator to paste the headers and parse them automatically.

Tiers

OpenAI bumps your tier — and your rate limits — based on cumulative spend. The first $5 unlocks tier 1; subsequent thresholds unlock 2, 3, 4, 5. Same key, more headroom over time. Anthropic uses workspace-level limits you can request increases on.

What 429 actually means

A 429 Too Many Requests can mean three different things:

You burst past RPM. Wait the reset window.
You burst past TPM on a single huge call. Trim the prompt.
The provider is shedding load globally and 429ing everyone. Check the status page.

Don't blindly retry on 429. Backoff with jitter, and respect the retry-after header if present.

Stopping the bleed

If you're hitting limits in production, the order of escalation is: 1) batch + cache, 2) move to a faster tier, 3) request a limit increase, 4) move to a different provider for that workload, 5) actually fix your code if you're calling 5× too often.

What to do next

Read about rotating keys — high-throughput workloads sometimes split across multiple keys to multiply the effective limit, which only works if you can rotate cleanly.