Vision and multimodal inputs
Send images and text to vision-capable models using the OpenAI-compatible message content array.
Vision-capable models accept images alongside text in the same request.
Use the OpenAI-compatible content array with type: "text" and type: "image_url" parts.
Choose a vision-capable model id from your models list before sending images.
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.ABLIT_KEY,
baseURL: "https://api.abliteration.ai/v1",
});
const response = await client.chat.completions.create({
model: "vision-model-id",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe the image in one sentence." },
{ type: "image_url", image_url: { url: "https://example.com/image.jpg" } },
],
},
],
});
console.log(response.choices[0]?.message?.content);Message content format
For vision inputs, set message.content to an array of parts that mixes text and image URLs.
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
]
}Image formats and limits
The backend accepts the four most common web image formats. Stay within the size cap and use either a public HTTPS URL or an inline base64 data URL.
Image token counting
Vision models count image tokens via Qwen2.5-VL's smart_resize — the image is tiled into 28-pixel patches, and tokens scale with (height × width) / 784. Token cost goes up with image dimensions, not file size: a 4 MB high-resolution photo can use far more tokens than a 200 KB upscaled JPEG.
Recommendation: downscale to 768 px on the longest side for general descriptions and ~1280 px for OCR or fine-detail tasks. A 768×768 image uses roughly 750 tokens; a 1280×1280 image uses roughly 2,000 tokens.
Multiple images per request
You can include more than one image_url part in a single message. Each additional image adds latency and tokens — keep the count low (≤4) for predictable response times.
{
"role": "user",
"content": [
{ "type": "text", "text": "Compare these screenshots." },
{ "type": "image_url", "image_url": { "url": "https://example.com/before.png" } },
{ "type": "image_url", "image_url": { "url": "https://example.com/after.png" } }
]
}Streaming vision responses
Vision outputs can be streamed the same way as text outputs. See the streaming guide for UI patterns. Time to first token is higher for vision than for text-only because the backend tokenizes image patches before generating.
Image moderation
Every image is moderated server-side via OpenAI's omni-moderation API before reaching the model. Rejected images return HTTP 400 with error.code = "moderation_blocked" and the offending category in the error message.
The accompanying text prompt is moderated separately. If either fails, the request is rejected with a single 400 response.
Anonymous and free-tier access
Image attachments are allowed for anonymous (X-Free-Tier: true) callers. The same per-IP free-tier quotas apply — see the pricing page. Video attachments are not allowed for anonymous callers — see the video docs.