Docs
Rate limits and retries
Rate limits protect reliability and vary by plan, model, and load.
Handle 429 responses with backoff and honor any Retry-After header.
Use request queues and concurrency limits to smooth traffic spikes.
Quick start
Example request
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
async function chatWithRetry(body, maxRetries = 5) {
for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
const res = await fetch("https://api.abliteration.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": "Bearer " + process.env.ABLIT_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify(body),
});
if (res.status !== 429) return res.json();
const retryAfter = Number(res.headers.get("Retry-After"));
const backoffSeconds = Number.isFinite(retryAfter)
? retryAfter
: Math.min(2 ** attempt, 30);
await sleep(backoffSeconds * 1000);
}
throw new Error("Rate limit exceeded");
}
const result = await chatWithRetry({
model: "abliterated-model",
messages: [{ role: "user", content: "Give me three bullet points." }],
});Service notes
- Pricing model: Usage-based pricing (~$5 per 1M tokens) billed on total tokens (input + output). See the API pricing page for current plans.
- Data retention: No prompt/output retention by default. Operational telemetry (token counts, timestamps, error codes) is retained for billing and reliability.
- Compatibility: OpenAI-style /v1/chat/completions request and response format with a base URL switch.
- Latency: Depends on model size, prompt length, and load. Streaming reduces time-to-first-token.
- Throughput: Team plans include priority throughput. Actual throughput varies with demand.
- Rate limits: Limits vary by plan and load. Handle 429s with backoff and respect any Retry-After header.
How rate limits apply
Limits are usually enforced as per-minute budgets for requests and tokens. Exact limits can vary by plan or model.
- Short requests still count toward request limits.
- Long prompts and long outputs consume more token budget.
- Parallel requests share the same limit window.
Headers to monitor
Check response headers for guidance on pacing. Some headers may be provider-specific.
- Retry-After for recommended wait time after a 429.
- x-ratelimit-* headers, if provided, for remaining capacity.
- Request or trace ids for debugging with support.
Backoff and retry strategy
Use exponential backoff with jitter and cap maximum delays for a smoother recovery.
- Respect Retry-After whenever it is present.
- Spread retries across workers to avoid thundering herds.
- Fail fast for non-429 errors and log them separately.
Concurrency control
Queues and concurrency limits keep your traffic within budget and improve success rates.
- Limit concurrent requests per user or tenant.
- Batch low-priority work and run it off-peak.
- Use streaming for large responses to reduce user wait time.
Common errors & fixes
- 401 Unauthorized: Check that your API key is set and sent as a Bearer token.
- 404 Not Found: Make sure the base URL ends with /v1 and you call /chat/completions.
- 400 Bad Request: Verify the model id and that messages are an array of { role, content } objects.
- 429 Rate limit: Back off and retry. Use the Retry-After header for pacing.
- 429 Rate limit: Back off and retry with jitter. Respect Retry-After if present.
- 503 Service unavailable: Retry with exponential backoff and reduce concurrency temporarily.