ReferenceUpdated 2026-05-01

Multimodal LLM API (text + image + video)

OpenAI-compatible multimodal API accepting text, images, and short video on /v1/chat/completions.

A multimodal LLM API accepts more than one type of input — typically text, images, and short video — in the same request.

abliteration.ai exposes a multimodal API that mirrors OpenAI Chat Completions, so existing SDKs work without changes beyond switching the base URL.

Definition

Multimodal LLM API (text + image + video)

A multimodal LLM API accepts mixed input types — text, images, and short videos — in a single chat-completions request and returns natural-language or structured outputs grounded in all the inputs together.

Why it matters
  • Describe screenshots, diagrams, or scanned documents alongside text instructions.
  • Combine a short video clip with a text prompt for grounded scene description.
  • Extract structured data (tables, captions, counts) from images or video frames.
  • Reduce round-trips by sending all the context the model needs in one request.
How it works
  1. 01Send a single /v1/chat/completions request whose messages array contains user-role messages with multipart content.
  2. 02Each message.content is an array that mixes parts: { type: 'text', text }, { type: 'image_url', image_url: { url } }, { type: 'video_url', video_url: { url } }.
  3. 03Use HTTPS URLs or data: URLs. The backend fetches HTTPS URLs server-side and SSRF-guards both image and video URLs.
  4. 04Authenticate with a JWT or API key — anon free-tier callers can attach images but not video.
  5. 05Stream responses with stream: true; delta chunks contain text-only output regardless of input type.
Mixed text + image + video request
curl https://api.abliteration.ai/v1/chat/completions \
  -H "Authorization: Bearer $ABLIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "abliterated-model",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Compare the still and the clip — what changed?" },
          { "type": "image_url", "image_url": { "url": "https://example.com/before.jpg" } },
          { "type": "video_url", "video_url": { "url": "https://example.com/after.mp4" } }
        ]
      }
    ]
  }'
FAQ

Frequently asked questions.

What input types does the API accept?

text, image_url (image attachments), and video_url (video attachments — chat-completions only). Mix them in any order inside a single message.content array.

Which endpoints support which inputs?

/v1/chat/completions and /policy/chat/completions accept text + image + video. /v1/messages, /v1/responses, and their /policy/... siblings accept text + image only — video is rejected.

What are the size limits?

15 MB per image (PNG/JPEG/WEBP/GIF), 25 MB per video (MP4/WEBM/MOV). Total request body cap is 35 MB after base64 encoding.

Are inputs moderated?

Text and images are sent through OpenAI's omni-moderation API server-side. Per-frame video moderation is planned (TODO(VIDEO-MODERATION)). Rejections return HTTP 400 with a moderation reason code.

Can I send multiple images and a video in the same request?

Yes. There's no fixed cap on parts per message — you're limited by the 35 MB body cap and by per-attachment size limits. Latency goes up with each attachment.

Is the API OpenAI-compatible?

Yes. Use the OpenAI Python or Node SDK with baseURL pointing at https://api.abliteration.ai/v1 — existing chat-completions code works unmodified for text + image. Video uses the standard OpenAI video_url part shape.

Does anon free-tier work?

Yes for text and image, no for video. Anon callers send X-Free-Tier: true. They get one free call. Video requires authentication — text+image stay in the free tier.

What about streaming?

Set stream: true. SSE delta chunks come back the same way regardless of input type — output is text. Time to first token is higher for image/video inputs because the backend samples and tokenizes media before generating.