How do I fix a 401 Unauthorized error from abliteration.ai?

Check that your API key is set and sent as a Bearer token.

How do I fix a 404 Not Found error from abliteration.ai?

Make sure the base URL ends with /v1 and you call /chat/completions.

How do I fix a 400 Bad Request error from abliteration.ai?

Verify the model id and that messages are an array of { role, content } objects.

How do I fix a 429 Rate limit error from abliteration.ai?

Back off and retry. Use the Retry-After header for pacing.

Docs

Rate limits and retries

Rate limits protect reliability and vary by plan, model, and load.

Handle 429 responses with backoff and honor any Retry-After header.

Use request queues and concurrency limits to smooth traffic spikes.

Quick start

Example request

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

async function chatWithRetry(body, maxRetries = 5) {
  for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
    const res = await fetch("https://api.abliteration.ai/v1/chat/completions", {
      method: "POST",
      headers: {
        "Authorization": "Bearer " + process.env.ABLIT_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    });

    if (res.status !== 429) return res.json();

    const retryAfter = Number(res.headers.get("Retry-After"));
    const backoffSeconds = Number.isFinite(retryAfter)
      ? retryAfter
      : Math.min(2 ** attempt, 30);
    await sleep(backoffSeconds * 1000);
  }

  throw new Error("Rate limit exceeded");
}

const result = await chatWithRetry({
  model: "abliterated-model",
  messages: [{ role: "user", content: "Give me three bullet points." }],
});

Service notes

Pricing model: Usage-based pricing (~$5 per 1M tokens) billed on total tokens (input + output). See the API pricing page for current plans.
Data retention: No prompt/output retention by default. Operational telemetry (token counts, timestamps, error codes) is retained for billing and reliability.
Compatibility: OpenAI-style /v1/chat/completions request and response format with a base URL switch.
Latency: Depends on model size, prompt length, and load. Streaming reduces time-to-first-token.
Throughput: Team plans include priority throughput. Actual throughput varies with demand.
Rate limits: Limits vary by plan and load. Handle 429s with backoff and respect any Retry-After header.

How rate limits apply

Limits are usually enforced as per-minute budgets for requests and tokens. Exact limits can vary by plan or model.

Short requests still count toward request limits.
Long prompts and long outputs consume more token budget.
Parallel requests share the same limit window.

Headers to monitor

Check response headers for guidance on pacing. Some headers may be provider-specific.

Retry-After for recommended wait time after a 429.
x-ratelimit-* headers, if provided, for remaining capacity.
Request or trace ids for debugging with support.

Backoff and retry strategy

Use exponential backoff with jitter and cap maximum delays for a smoother recovery.

Respect Retry-After whenever it is present.
Spread retries across workers to avoid thundering herds.
Fail fast for non-429 errors and log them separately.

Concurrency control

Queues and concurrency limits keep your traffic within budget and improve success rates.

Limit concurrent requests per user or tenant.
Batch low-priority work and run it off-peak.
Use streaming for large responses to reduce user wait time.

Common errors & fixes

401 Unauthorized: Check that your API key is set and sent as a Bearer token.
404 Not Found: Make sure the base URL ends with /v1 and you call /chat/completions.
400 Bad Request: Verify the model id and that messages are an array of { role, content } objects.
429 Rate limit: Back off and retry. Use the Retry-After header for pacing.
429 Rate limit: Back off and retry with jitter. Respect Retry-After if present.
503 Service unavailable: Retry with exponential backoff and reduce concurrency temporarily.