Use caseUpdated 2026-05-01

Document image extraction API

OCR-style structured extraction from photos and scans of receipts, invoices, IDs, forms, and tables — via OpenAI-compatible chat completions.

Document image extraction pulls structured data — fields, tables, totals, signatures — out of scanned or photographed documents.

It's the OCR-plus-reasoning pattern: instead of returning raw text strings, the model returns a typed JSON object that downstream code can validate and route.

Definition

Document image extraction API

A document image extraction API accepts a photo or scan of a document plus an extraction schema, and returns structured data (typically JSON) with the requested fields populated from what's visible in the image.

Why it matters

Replace handwritten data entry for receipts, invoices, expense reports, and intake forms.
Pull line items + totals out of variable-format invoices without writing per-vendor parsers.
Extract IDs and addresses from KYC documents into a typed schema your verification flow can act on.
Convert paper questionnaires and surveys into structured rows ready for analytics.

How it works

01Photograph or scan the document at a resolution where the smallest required text is legible.
02POST to /v1/chat/completions with a system prompt that defines the schema (or a JSON schema if you use structured outputs) plus the image as an image_url block.
03Ask for JSON output explicitly: 'Reply with JSON matching this schema. If a field is not visible, set it to null.'
04Validate the parsed JSON client-side — never trust an LLM's output to be schema-conformant without a check.
05For multi-page documents, send each page as a separate image_url in the same request.

Extract a receipt into structured JSON

curl https://api.abliteration.ai/v1/chat/completions \
  -H "Authorization: Bearer $ABLIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "abliterated-model",
    "messages": [
      {
        "role": "system",
        "content": "Extract receipt data. Reply only with JSON: {merchant, date, items:[{name,qty,price}], subtotal, tax, total, currency}. Use null for fields not visible."
      },
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Extract this receipt." },
          { "type": "image_url", "image_url": { "url": "https://example.com/receipt.jpg" } }
        ]
      }
    ]
  }'

FAQ

Frequently asked questions.

Is this real OCR?

It's an OCR-replacement workflow, not a Tesseract-style character recognizer. The vision model reads the document and returns the fields you asked for. For raw character-by-character text you can ask for 'verbatim transcription' — but most extraction tasks benefit from skipping the intermediate OCR step.

What document types work well?

Receipts, invoices, purchase orders, IDs, business cards, intake forms, prescription labels, packing slips, bank statements, and tables. Highly stylized handwriting is the hardest case — typed/printed text and clear handwriting both work well.

How do I get reliable JSON output?

Define the schema explicitly in the system prompt with field names, types, and 'null if not visible' defaults. Validate client-side with Zod / Pydantic / your validator of choice. For mission-critical flows, use a two-pass approach: extract, then ask the model to verify its own extraction against the image.

Can it handle multi-page documents?

Yes — send each page as a separate image_url block in the same message. 'Combine fields across pages into a single JSON object' works. For large documents (>10 pages) split into multiple requests and merge client-side.

What resolution should I send?

Aim for at least 300 DPI source. After capture, downscale so the smallest required text is around 30 pixels tall. For receipts that's typically 1280–2048 px on the longest side. Higher resolution increases token cost (roughly (H × W) / 784 tokens) without much accuracy gain past 2048px.

How does this handle PII?

abliteration.ai is zero-data-retention by default — extracted documents are not stored beyond the request. For redaction-during-extraction patterns see /policy-template/pii-redaction-safe-rewrite.

What about table extraction?

Works well. Define the schema as 'rows: [{col1, col2, ...}]' in the system prompt. For wide or complex tables, ask for one section at a time and merge client-side. Watch for column-header drift on multi-page tables.

Are document images moderated?

Yes — same OpenAI omni-moderation as any image. Standard business documents almost never trigger rejection. For ID verification flows that include face photos, use the /policy/* siblings if you need workspace-specific moderation thresholds.

Next steps.

Vision and multimodal inputs guide Image LLM API Screenshot analysis API Multimodal LLM API overview PII redaction policy template See API Pricing View Unrestricted Models Rate limits Privacy policy