Reference

Vision-capable LLM API

Vision-capable LLM APIs accept images and text in the same request.

The OpenAI-compatible content array lets you mix text and image_url parts.

Definition of Vision-capable LLM API

A vision-capable LLM API lets you include images as inputs and receive natural language or structured outputs from the model.

Why Vision-capable LLM API matters

Extract descriptions, captions, or structured data from images.
Summarize screenshots, diagrams, and scanned documents.
Combine visual context with text instructions for better reasoning.

How it works

Choose a vision-capable model id.
Send message.content as an array of text and image_url parts.
Use HTTPS URLs or base64 data URLs for images.
Stream outputs when you need faster time-to-first-token.

Example request

curl https://api.abliteration.ai/v1/chat/completions \
  -H "Authorization: Bearer $ABLIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision-model-id",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe the image." },
          { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
        ]
      }
    ]
  }'

FAQ

Frequently Asked Questions

Can I send multiple images in one request?

Yes. Include multiple image_url parts, but keep the count small to avoid high latency.

Do I need to send base64 data?

No. Public HTTPS URLs work, and base64 data URLs are optional if you cannot host the image.

Does streaming work with vision outputs?

Yes. Use stream: true and handle delta chunks just like text-only responses.