Multimodal LLM API (text + image + video)
OpenAI-compatible multimodal API accepting text, images, and short video on /v1/chat/completions.
A multimodal LLM API accepts more than one type of input — typically text, images, and short video — in the same request.
abliteration.ai exposes a multimodal API that mirrors OpenAI Chat Completions, so existing SDKs work without changes beyond switching the base URL.
Multimodal LLM API (text + image + video)
A multimodal LLM API accepts mixed input types — text, images, and short videos — in a single chat-completions request and returns natural-language or structured outputs grounded in all the inputs together.
- Describe screenshots, diagrams, or scanned documents alongside text instructions.
- Combine a short video clip with a text prompt for grounded scene description.
- Extract structured data (tables, captions, counts) from images or video frames.
- Reduce round-trips by sending all the context the model needs in one request.
- 01Send a single /v1/chat/completions request whose messages array contains user-role messages with multipart content.
- 02Each message.content is an array that mixes parts: { type: 'text', text }, { type: 'image_url', image_url: { url } }, { type: 'video_url', video_url: { url } }.
- 03Use HTTPS URLs or data: URLs. The backend fetches HTTPS URLs server-side and SSRF-guards both image and video URLs.
- 04Authenticate with a JWT or API key — anon free-tier callers can attach images but not video.
- 05Stream responses with stream: true; delta chunks contain text-only output regardless of input type.
curl https://api.abliteration.ai/v1/chat/completions \
-H "Authorization: Bearer $ABLIT_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "abliterated-model",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Compare the still and the clip — what changed?" },
{ "type": "image_url", "image_url": { "url": "https://example.com/before.jpg" } },
{ "type": "video_url", "video_url": { "url": "https://example.com/after.mp4" } }
]
}
]
}'Frequently asked questions.
What input types does the API accept?
text, image_url (image attachments), and video_url (video attachments — chat-completions only). Mix them in any order inside a single message.content array.
Which endpoints support which inputs?
/v1/chat/completions and /policy/chat/completions accept text + image + video. /v1/messages, /v1/responses, and their /policy/... siblings accept text + image only — video is rejected.
What are the size limits?
15 MB per image (PNG/JPEG/WEBP/GIF), 25 MB per video (MP4/WEBM/MOV). Total request body cap is 35 MB after base64 encoding.
Are inputs moderated?
Text and images are sent through OpenAI's omni-moderation API server-side. Per-frame video moderation is planned (TODO(VIDEO-MODERATION)). Rejections return HTTP 400 with a moderation reason code.
Can I send multiple images and a video in the same request?
Yes. There's no fixed cap on parts per message — you're limited by the 35 MB body cap and by per-attachment size limits. Latency goes up with each attachment.
Is the API OpenAI-compatible?
Yes. Use the OpenAI Python or Node SDK with baseURL pointing at https://api.abliteration.ai/v1 — existing chat-completions code works unmodified for text + image. Video uses the standard OpenAI video_url part shape.
Does anon free-tier work?
Yes for text and image, no for video. Anon callers send X-Free-Tier: true. They get one free call. Video requires authentication — text+image stay in the free tier.
What about streaming?
Set stream: true. SSE delta chunks come back the same way regardless of input type — output is text. Time to first token is higher for image/video inputs because the backend samples and tokenizes media before generating.