# Document Extraction and Parsing for Agents

> Practitioner reference for the document-ingestion pipeline agents use: parse/OCR, layout/structure extraction, schema-constrained field extraction — with a verified tooling landscape (OSS and cloud).

Category: Reference · Updated: 2026-06-16 · Tags: documents, ocr, parsing, pdf, extraction, rag, agents, vlm
Canonical: https://changegamer.ai/resources/document-extraction-for-agents

Agents that ingest PDFs, scans, and tabular documents must solve three distinct problems before any LLM reasoning happens: (1) get readable text and structure out of the file, (2) preserve layout cues — tables, headings, reading order — so chunking is coherent, and (3) optionally extract specific fields into a schema. These are separate stages with separate failure modes.

## The two-stage distinction

**Stage A — Document-to-structure parsing:** Convert raw bytes (PDF, DOCX, image) into clean Markdown or structured JSON that an LLM can read cheaply. The output is a faithful representation of the document, not a business object.

**Stage B — Schema-constrained field extraction:** Take the structured output from Stage A (or feed the raw document to a VLM directly) and extract specific fields — invoice total, signatory name, table rows — into a validated schema. This is where structured-output guarantees from the LLM layer apply (see /resources/reliable-tool-calling).

Conflating A and B is the most common architecture mistake: trying to do field extraction before the document is cleanly parsed, or running a heavyweight extraction pipeline on documents that only need Markdown for RAG.

## Three parsing approaches

**1. Traditional OCR + layout analysis** — a pipeline of specialized models: text detection, character recognition, layout classification, table reconstruction, reading-order recovery. Battle-tested on high-volume document processing. Hard cases: multi-column PDFs, rotated scans, overlapping text/image regions, complex table spans. The cloud services below handle this tier.

**2. Vision-language models (VLMs) reading documents directly** — pass a page image to a multimodal LLM and ask it to return Markdown or JSON. Higher per-page cost than traditional OCR but handles layout ambiguity that rule-based pipelines miss. Best for low-to-medium volume, complex layouts, or when the extraction schema is known upfront.

**3. Document-parsing libraries and services targeting LLM output** — tools purpose-built to produce LLM-ready Markdown/JSON: they combine OCR, layout models, and optional VLM passes internally, exposing a clean API. The open-source tools below fall here.

## Open-source and library tooling

**Docling** (IBM, MIT license, github.com/docling-project/docling) — converts PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, and more into Markdown or JSON. Includes TableFormer for table structure recovery, reading-order detection, and formula handling. Integrates with LangChain and LlamaIndex. Hosted under the LF AI & Data Foundation. Verified open-source.

**Marker** (Datalab / Vik Paruchuri, GPL-3.0 code / AI Pubs Open Rail-M weights, github.com/datalab-to/marker) — converts PDF, DOCX, PPTX, XLSX, EPUB, images to Markdown or JSON with optional JSON Schema extraction. Uses Surya internally for OCR and layout. Cloud API available at datalab.to. ~36K GitHub stars. Weights are free for research and startups under $2M revenue; commercial licensing required beyond that.

**Surya OCR** (Datalab, Apache 2.0 code / AI Pubs Open Rail-M weights, github.com/datalab-to/surya) — 650M-parameter model for OCR, layout analysis, reading-order detection, and table recognition in 90+ languages. Used as the backbone inside Marker. Weights free for research and startups under $5M revenue.

**Unstructured** (Unstructured-IO, Apache 2.0, github.com/Unstructured-IO/unstructured) — document ETL library for LLMs; partitions 40+ document types into typed elements (Title, NarrativeText, Table, Image), with 30+ connectors for data sources. Widely used for RAG ingestion pipelines. Also available as a hosted platform for production-grade workflows.

**MinerU** (OpenDataLab, custom Apache-2.0-based license since v3.1, github.com/opendatalab/MinerU) — high-accuracy PDF-to-Markdown/JSON engine supporting 109 languages via a VLM+OCR dual engine. Converts PDF, DOCX, PPTX, XLSX, images, and web pages; preserves tables as HTML, formulas as LaTeX. ~68K GitHub stars.

**LlamaParse** (LlamaIndex, cloud SaaS, llamaindex.ai/llamaparse) — managed document parsing service; four processing tiers (Fast, Cost Effective, Agentic, Agentic Plus); outputs LLM-ready Markdown and structured JSON; free tier includes ~1,000 pages/month. Backed by GPT-4.1 and Gemini 2.5 Pro for complex layouts (as of May 2025).

## Cloud / managed services

**AWS Textract** (Amazon, SaaS, docs.aws.amazon.com/textract/) — layout-aware OCR returning blocks, key-value pairs, tables, and query responses as structured JSON. Synchronous API for images; asynchronous batch for multi-page PDFs. Native fit for AWS-first teams; tight S3 / Lambda integration.

**Azure AI Document Intelligence** (Microsoft, SaaS, learn.microsoft.com/azure/ai-services/document-intelligence/) — prebuilt models for invoices, receipts, tax forms, IDs, and general layout; custom models for domain-specific extraction. REST and SDK (Python, C#, Java, JS). GA version 4.0 (API 2024-11-30). Strong choice for Microsoft-stack environments.

**Google Document AI** (Google Cloud, SaaS, cloud.google.com/document-ai) — layout-aware OCR with pretrained processors for forms, invoices, and identity documents; custom extractor backed by Gemini 2.5 Pro (Preview, June 2025). Returns structured JSON including key-value pairs, tables, and bounding-box coordinates.

**Mistral OCR** (Mistral AI, SaaS API, docs.mistral.ai/models/ocr-3-25-12) — cloud OCR API accessed via the `/v1/ocr` endpoint or SDK. Latest version: OCR 3 (v25.12, December 2025). Returns Markdown with HTML table reconstruction; handles interleaved images, math, and complex layouts. Pricing: $2 per 1,000 pages ($1 with Batch API discount). Verified SaaS; model weights are not open.

**Reducto** (Reducto AI, SaaS, reducto.ai) — agentic document platform providing layout-aware OCR, parse/split/extract endpoints, and schema-grounded field extraction for production agent pipelines. 1B+ pages processed; targets enterprise accuracy requirements. Paid; funding verified ($108M Series B).

## Structured field extraction (Stage B)

Once a document is parsed to Markdown or JSON, field extraction is a structured-output problem: define a JSON Schema for the target fields, run an LLM with constrained decoding or tool-calling strict mode, and validate the result. See /resources/reliable-tool-calling for the full pattern.

VLM-based extraction (pass the page image directly to a multimodal model with a schema prompt) skips Stage A but is costlier per page and harder to debug when fields are missed.

**Hard cases to test:** multi-column layouts, merged table cells, rotated or skewed scans, handwritten annotations, and forms with no consistent field label placement.

For high-stakes fields (financial, medical, legal), cross-validate extracted values against OCR confidence scores and route low-confidence results to a human review queue.

## The agent integration angle

Parsed document output (Markdown or JSON) feeds directly into the RAG layer. Clean structure at parse time is the single biggest lever on retrieval quality. See /resources/rag-retrieval-for-agents for chunking strategies and embedding choices; and /resources/embeddings-vector-search for index selection.

Document content is an untrusted surface: a malicious PDF can embed prompt-injection payloads in OCR-readable text. Strip or escape instruction-like patterns from parsed output before inserting it into agent context. See /resources/agentic-security-checklist, section 4 (untrusted content handling).

## Verified sources

- Docling (IBM, MIT): https://github.com/docling-project/docling
- Marker (Datalab, GPL-3.0): https://github.com/datalab-to/marker
- Surya OCR (Datalab, Apache 2.0): https://github.com/datalab-to/surya
- Unstructured (Apache 2.0): https://github.com/Unstructured-IO/unstructured
- MinerU (OpenDataLab): https://github.com/opendatalab/MinerU
- LlamaParse docs: https://developers.llamaindex.ai/python/cloud/llamaparse/tiers/
- AWS Textract docs: https://docs.aws.amazon.com/textract/latest/dg/what-is.html
- Azure AI Document Intelligence overview: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview
- Google Document AI overview: https://cloud.google.com/document-ai/docs/overview
- Mistral OCR 3 docs: https://docs.mistral.ai/models/ocr-3-25-12
- Reducto agentic document platform: https://reducto.ai/
