Use caseUpdated 2026-05-01

Screenshot analysis API

Send screenshots to a multimodal LLM and get descriptions, error explanations, or structured UI extraction back.

Screenshot analysis turns pixel-level UI captures into text the rest of your app can act on.

Common patterns: customer support triage, automated bug reports, accessibility narration, end-to-end test failure summaries, and SaaS onboarding assistants.

Definition

Screenshot analysis API

A screenshot analysis API accepts a screenshot image plus a text prompt and returns a description, structured extraction, or error explanation grounded in what's visible on screen.

Why it matters
  • Customers send screenshots faster than they describe problems — extract the actual issue from the picture.
  • End-to-end test runs produce thousands of failure screenshots — summarize them at scale instead of reviewing one by one.
  • Accessibility tools can narrate dynamic UI states without alt-text instrumentation.
  • Onboarding flows can answer 'what does this screen mean?' without writing per-screen documentation.
How it works
  1. 01Capture the screenshot client-side (HTML5 canvas, OS APIs, headless browser) and base64-encode or upload to a public URL.
  2. 02POST to /v1/chat/completions with content blocks: a text prompt describing the task, then an image_url block with the screenshot.
  3. 03For repeated tasks, write a focused prompt — 'List the visible error message and the button the user most likely should click next.' beats 'Describe this image.'
  4. 04Stream responses (stream: true) when the user is waiting for the answer.
  5. 05For structured output, ask for JSON in the prompt and validate the response client-side.
Triage a support screenshot
curl https://api.abliteration.ai/v1/chat/completions \
  -H "Authorization: Bearer $ABLIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "abliterated-model",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Look at this support screenshot. Reply with JSON: {error_message, screen_name, suggested_next_action}."
          },
          { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
        ]
      }
    ]
  }'
FAQ

Frequently asked questions.

What's the right resolution for screenshots?

Match the source — don't upscale. The model uses Qwen2.5-VL's smart_resize tokenizer, so dimensions drive token cost (≈ tokens = (H × W) / 784). Downscale to 768px on the longest side for general descriptions, 1280px for fine UI text.

Can it read on-screen text reliably?

Yes for clear, large UI text. For dense text (logs, code, terminal output) it helps to crop tightly around the relevant region and include a prompt like 'Quote the exact error text verbatim.' For long documents, use the document-image-extraction pattern instead.

How do I avoid hallucinated descriptions?

Be explicit about uncertainty: prompt with 'If you cannot tell from the image, say so — do not guess.' Ask for direct quotes when text matters. For numeric extraction, ask the model to label its confidence.

Is it OK to send screenshots that contain user data?

Yes — abliteration.ai is zero-data-retention by default. Prompts and images are not stored beyond the request lifecycle. For per-tenant guarantees see /zero-data-retention-ai-api.

How fast is the response?

First token typically arrives in 1–3 seconds for a 1280×720 screenshot. Use stream: true to start showing output as it generates. Latency scales with image dimensions, not file size.

Can I send multiple screenshots at once?

Yes — add multiple image_url blocks. 'Compare these two screens and tell me what changed' is a common pattern. Keep it to ≤4 to maintain response quality and predictable latency.

Are screenshots moderated?

Yes — same OpenAI omni-moderation as any image attachment. UI screenshots almost never trigger rejection unless they contain user-generated content that crosses moderation thresholds.