Understanding rate limits
Third-party APIs enforce rate limits to protect their infrastructure. Common limit types:
- Requests per second (burst limit) — typically 1–10 requests per second
- Requests per minute — common for REST APIs
- Daily quotas — often used for expensive operations like LLM API calls
- Concurrent connection limits
Hitting a rate limit returns a 429 response. Hitting it repeatedly may result in temporary or permanent API key suspension.
The token bucket implementation
For controlling your own outbound request rate, implement a token bucket algorithm. Tokens represent the right to make one API request. Tokens are added to the bucket at the API's allowed rate. A request consumes one token. If the bucket is empty, the request waits.
This smooths request traffic and ensures you never exceed the API's limits, even during burst conditions.
Handling 429 responses
When you receive a 429:
1. Check for a Retry-After header — this tells you exactly how long to wait 2. If no Retry-After is present, wait with exponential backoff (start at 1 second, double each retry) 3. Add jitter (a small random delay) to prevent retry storms when multiple workers hit the limit simultaneously
Queuing for high-volume workloads
For workloads that need to make thousands of API calls (bulk data sync, report generation, notification delivery), use a job queue with concurrency limits rather than making all calls in parallel. A queue gives you:
- Control over the request rate
- Automatic retry on failure
- Visibility into backlog and processing rate
- Graceful handling of API downtime
Caching to reduce API calls
Cache API responses where the data does not change frequently. A company's address from a data enrichment API, a user's account status from a CRM, or a product's price from a catalogue API — these do not change on every request. Even a 5-minute TTL cache can dramatically reduce API call volume for high-traffic integrations.