ReferenceUpdated 2025-12-30
LLM API rate limits
Definition and best practices for handling LLM API rate limits, retries, and backoff.
Rate limits cap request and token throughput to keep the service reliable.
Handle 429 responses with backoff and spread retries across workers.
Definition
LLM API rate limits
A rate limit is a provider-enforced cap on requests or tokens within a time window.
Why it matters
- Protects overall reliability and availability.
- Ensures fair use across teams and tenants.
- Prevents sudden traffic spikes from overloading the API.
How it works
- 01Track requests and token usage over time.
- 02Honor Retry-After headers when you receive 429s.
- 03Use queues and concurrency limits to smooth bursts.
- 04Monitor headers and logs for remaining capacity when available.
Example response
HTTP/1.1 429 Too Many Requests Retry-After: 2
FAQ
Frequently asked questions.
Are rate limits per API key or per account?
Limits are typically enforced per account or API key. Check your dashboard for exact limits.
Should I retry immediately after a 429?
No. Wait for Retry-After or use exponential backoff with jitter before retrying.
How can I avoid hitting limits?
Batch low-priority work, reduce parallelism, and cache repeated responses.