Docs

Rate limits and retries

Learn how rate limits work and how to implement backoff, retries, and concurrency control.

Updated 2025-12-30

Rate limits protect reliability and vary by plan, model, and load.

Handle 429 responses with backoff and honor any Retry-After header.

Use request queues and concurrency limits to smooth traffic spikes.

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

async function chatWithRetry(body, maxRetries = 5) {
  for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
    const res = await fetch("https://api.abliteration.ai/v1/chat/completions", {
      method: "POST",
      headers: {
        "Authorization": "Bearer " + process.env.ABLIT_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    });

    if (res.status !== 429) return res.json();

    const retryAfter = Number(res.headers.get("Retry-After"));
    const backoffSeconds = Number.isFinite(retryAfter)
      ? retryAfter
      : Math.min(2 ** attempt, 30);
    await sleep(backoffSeconds * 1000);
  }

  throw new Error("Rate limit exceeded");
}

const result = await chatWithRetry({
  model: "abliterated-model",
  messages: [{ role: "user", content: "Give me three bullet points." }],
});

How rate limits apply

Limits are usually enforced as per-minute budgets for requests and tokens. Exact limits can vary by plan or model.

Headers to monitor

Check response headers for guidance on pacing. Some headers may be provider-specific.

Backoff and retry strategy

Use exponential backoff with jitter and cap maximum delays for a smoother recovery.

Concurrency control

Queues and concurrency limits keep your traffic within budget and improve success rates.