How do I fix a 401 Unauthorized error from abliteration.ai?

Check that your API key is set and sent as a Bearer token.

How do I fix a 404 Not Found error from abliteration.ai?

Make sure the base URL ends with /v1 and you call /chat/completions.

How do I fix a 400 Bad Request error from abliteration.ai?

Verify the model id and that messages are an array of { role, content } objects.

How do I fix a 429 Rate limit error from abliteration.ai?

Back off and retry. Use the Retry-After header for pacing.

Docs

Streaming chat completions

Streaming reduces time-to-first-token and delivers partial output as it is generated.

Use the OpenAI SDK with stream: true and iterate over chunks to render tokens immediately.

Streaming is ideal for chat UIs, typing indicators, and long-form generation where early feedback matters.

Quick start

Example request

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.ABLIT_KEY,
  baseURL: "https://api.abliteration.ai/v1",
});

const stream = await client.chat.completions.create({
  model: "abliterated-model",
  messages: [{ role: "user", content: "Write a short haiku about the ocean." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Service notes

Pricing model: Usage-based pricing (~$5 per 1M tokens) billed on total tokens (input + output). See the API pricing page for current plans.
Data retention: No prompt/output retention by default. Operational telemetry (token counts, timestamps, error codes) is retained for billing and reliability.
Compatibility: OpenAI-style /v1/chat/completions request and response format with a base URL switch.
Latency: Depends on model size, prompt length, and load. Streaming reduces time-to-first-token.
Throughput: Team plans include priority throughput. Actual throughput varies with demand.
Rate limits: Limits vary by plan and load. Handle 429s with backoff and respect any Retry-After header.

When to stream

Stream when you want faster perceived latency or to show partial output.

Chat UXs that show tokens as they arrive.
Long responses where users benefit from early content.
Workflows that may cancel early once enough output is seen.

How streaming works

The response is sent as a series of chunks. Each chunk contains a delta that you append to the final message.

Set stream: true in the request body.
Consume the async iterator (SDK) or the HTTP stream (fetch).
Accumulate delta content until the stream ends.

Python streaming example

The Python SDK yields chunks you can iterate over. Append delta content as it arrives.

Python streaming example

from openai import OpenAI

client = OpenAI(
    base_url="https://api.abliteration.ai/v1",
    api_key="YOUR_ABLIT_KEY",
)

stream = client.chat.completions.create(
    model="abliterated-model",
    messages=[{"role": "user", "content": "Write a short haiku about the ocean."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="")

UI and reliability tips

Streaming is best-effort over long-lived HTTP connections, so plan for reconnects and graceful fallbacks.

Render a typing indicator before the first token arrives.
Flush UI updates in small batches for smoother rendering.
Cancel the request when the user navigates away.

Common errors & fixes

401 Unauthorized: Check that your API key is set and sent as a Bearer token.
404 Not Found: Make sure the base URL ends with /v1 and you call /chat/completions.
400 Bad Request: Verify the model id and that messages are an array of { role, content } objects.
429 Rate limit: Back off and retry. Use the Retry-After header for pacing.
No streaming output: Ensure stream: true and iterate over the async iterator from the SDK.
Connection closed early: Check proxy timeouts and keep-alive settings. Streaming requires a long-lived HTTP connection.

Quick start

Service notes

When to stream

How streaming works

Python streaming example

UI and reliability tips

Common errors & fixes

Related links