Docs
Streaming chat completions
Streaming reduces time-to-first-token and delivers partial output as it is generated.
Use the OpenAI SDK with stream: true and iterate over chunks to render tokens immediately.
Streaming is ideal for chat UIs, typing indicators, and long-form generation where early feedback matters.
Quick start
Example request
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.ABLIT_KEY,
baseURL: "https://api.abliteration.ai/v1",
});
const stream = await client.chat.completions.create({
model: "abliterated-model",
messages: [{ role: "user", content: "Write a short haiku about the ocean." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}Service notes
- Pricing model: Usage-based pricing (~$5 per 1M tokens) billed on total tokens (input + output). See the API pricing page for current plans.
- Data retention: No prompt/output retention by default. Operational telemetry (token counts, timestamps, error codes) is retained for billing and reliability.
- Compatibility: OpenAI-style /v1/chat/completions request and response format with a base URL switch.
- Latency: Depends on model size, prompt length, and load. Streaming reduces time-to-first-token.
- Throughput: Team plans include priority throughput. Actual throughput varies with demand.
- Rate limits: Limits vary by plan and load. Handle 429s with backoff and respect any Retry-After header.
When to stream
Stream when you want faster perceived latency or to show partial output.
- Chat UXs that show tokens as they arrive.
- Long responses where users benefit from early content.
- Workflows that may cancel early once enough output is seen.
How streaming works
The response is sent as a series of chunks. Each chunk contains a delta that you append to the final message.
- Set stream: true in the request body.
- Consume the async iterator (SDK) or the HTTP stream (fetch).
- Accumulate delta content until the stream ends.
Python streaming example
The Python SDK yields chunks you can iterate over. Append delta content as it arrives.
Python streaming example
from openai import OpenAI
client = OpenAI(
base_url="https://api.abliteration.ai/v1",
api_key="YOUR_ABLIT_KEY",
)
stream = client.chat.completions.create(
model="abliterated-model",
messages=[{"role": "user", "content": "Write a short haiku about the ocean."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="")UI and reliability tips
Streaming is best-effort over long-lived HTTP connections, so plan for reconnects and graceful fallbacks.
- Render a typing indicator before the first token arrives.
- Flush UI updates in small batches for smoother rendering.
- Cancel the request when the user navigates away.
Common errors & fixes
- 401 Unauthorized: Check that your API key is set and sent as a Bearer token.
- 404 Not Found: Make sure the base URL ends with /v1 and you call /chat/completions.
- 400 Bad Request: Verify the model id and that messages are an array of { role, content } objects.
- 429 Rate limit: Back off and retry. Use the Retry-After header for pacing.
- No streaming output: Ensure stream: true and iterate over the async iterator from the SDK.
- Connection closed early: Check proxy timeouts and keep-alive settings. Streaming requires a long-lived HTTP connection.