ReferenceUpdated 2026-05-01

Vision-capable LLM API

Definition and examples for sending images to vision-capable models via OpenAI-compatible chat completions.

Vision-capable LLM APIs accept images and text in the same request.

The OpenAI-compatible content array lets you mix text and image_url parts.

abliteration.ai exposes vision on /v1/chat/completions, /v1/messages, /v1/responses, and their /policy/... siblings.

Definition

Vision-capable LLM API

A vision-capable LLM API lets you include images as inputs and receive natural language or structured outputs from the model. On abliteration.ai, vision is available across all chat-style endpoints, with images moderated server-side before reaching the model.

Why it matters
  • Extract descriptions, captions, or structured data from images.
  • Summarize screenshots, diagrams, and scanned documents.
  • Combine visual context with text instructions for better reasoning.
  • Build OCR or product-description pipelines without a separate vision API.
How it works
  1. 01Choose a vision-capable model id.
  2. 02Send message.content as an array of text and image_url parts.
  3. 03Use HTTPS URLs or base64 data URLs for images.
  4. 04Keep raw image size under 15 MB and the longest side under ~1280 px for predictable latency.
  5. 05Stream outputs by setting stream: true when you need faster time-to-first-token.
Example request
curl https://api.abliteration.ai/v1/chat/completions \
  -H "Authorization: Bearer $ABLIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision-model-id",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe the image." },
          { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
        ]
      }
    ]
  }'
FAQ

Frequently asked questions.

Can I send multiple images in one request?

Yes. Include multiple image_url parts, but keep the count small (≤4) to avoid high latency. Each image adds tokens and processing time.

Do I need to send base64 data?

No. Public HTTPS URLs work, and base64 data URLs are optional if you cannot host the image. The backend fetches HTTPS URLs server-side and SSRF-guards against private IPs.

Does streaming work with vision outputs?

Yes. Use stream: true and handle delta chunks just like text-only responses. Time to first token is higher for vision than for text-only.

What image formats are supported?

PNG, JPEG, WEBP, and GIF. Convert anything else to PNG or JPEG before sending. Max raw size is 15 MB per image.

How many tokens does an image use?

An image is tokenized by chunking it into 28×28 pixel patches via Qwen2.5-VL's smart_resize. A 768×768 image uses roughly 750 tokens; a 1280×1280 image uses roughly 2,000 tokens. Token cost scales with dimensions, not file size — downscale to the smallest size that preserves the detail you need.

Are images moderated?

Yes. Every image is sent through OpenAI's omni-moderation API server-side. Rejected images return HTTP 400 with error.code = 'moderation_blocked'. The accompanying text prompt is moderated separately.

What about CSAM detection?

Required for B2B GA. Tracked as TODO(CSAM-B2B-GA) in the moderation pipeline; integration with hash-matching is planned before public launch.

Are pasted/dropped images supported in the playground?

Yes. Drop, paste, or click the paperclip in the playground or landing demo. Files validate locally before encoding.