Screenshot analysis API
Send screenshots to a multimodal LLM and get descriptions, error explanations, or structured UI extraction back.
Screenshot analysis turns pixel-level UI captures into text the rest of your app can act on.
Common patterns: customer support triage, automated bug reports, accessibility narration, end-to-end test failure summaries, and SaaS onboarding assistants.
Screenshot analysis API
A screenshot analysis API accepts a screenshot image plus a text prompt and returns a description, structured extraction, or error explanation grounded in what's visible on screen.
- Customers send screenshots faster than they describe problems — extract the actual issue from the picture.
- End-to-end test runs produce thousands of failure screenshots — summarize them at scale instead of reviewing one by one.
- Accessibility tools can narrate dynamic UI states without alt-text instrumentation.
- Onboarding flows can answer 'what does this screen mean?' without writing per-screen documentation.
- 01Capture the screenshot client-side (HTML5 canvas, OS APIs, headless browser) and base64-encode or upload to a public URL.
- 02POST to /v1/chat/completions with content blocks: a text prompt describing the task, then an image_url block with the screenshot.
- 03For repeated tasks, write a focused prompt — 'List the visible error message and the button the user most likely should click next.' beats 'Describe this image.'
- 04Stream responses (stream: true) when the user is waiting for the answer.
- 05For structured output, ask for JSON in the prompt and validate the response client-side.
curl https://api.abliteration.ai/v1/chat/completions \
-H "Authorization: Bearer $ABLIT_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "abliterated-model",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Look at this support screenshot. Reply with JSON: {error_message, screen_name, suggested_next_action}."
},
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
]
}
]
}'Frequently asked questions.
What's the right resolution for screenshots?
Match the source — don't upscale. The model uses Qwen2.5-VL's smart_resize tokenizer, so dimensions drive token cost (≈ tokens = (H × W) / 784). Downscale to 768px on the longest side for general descriptions, 1280px for fine UI text.
Can it read on-screen text reliably?
Yes for clear, large UI text. For dense text (logs, code, terminal output) it helps to crop tightly around the relevant region and include a prompt like 'Quote the exact error text verbatim.' For long documents, use the document-image-extraction pattern instead.
How do I avoid hallucinated descriptions?
Be explicit about uncertainty: prompt with 'If you cannot tell from the image, say so — do not guess.' Ask for direct quotes when text matters. For numeric extraction, ask the model to label its confidence.
Is it OK to send screenshots that contain user data?
Yes — abliteration.ai is zero-data-retention by default. Prompts and images are not stored beyond the request lifecycle. For per-tenant guarantees see /zero-data-retention-ai-api.
How fast is the response?
First token typically arrives in 1–3 seconds for a 1280×720 screenshot. Use stream: true to start showing output as it generates. Latency scales with image dimensions, not file size.
Can I send multiple screenshots at once?
Yes — add multiple image_url blocks. 'Compare these two screens and tell me what changed' is a common pattern. Keep it to ≤4 to maintain response quality and predictable latency.
Are screenshots moderated?
Yes — same OpenAI omni-moderation as any image attachment. UI screenshots almost never trigger rejection unless they contain user-generated content that crosses moderation thresholds.