Document Extraction and Parsing for Agents

Reference · updated 2026-06-16 · Markdown variant

Practitioner reference for the document-ingestion pipeline agents use: parse/OCR, layout/structure extraction, schema-constrained field extraction — with a verified tooling landscape (OSS and cloud).

Agents that ingest PDFs, scans, and tabular documents must solve three distinct problems before any LLM reasoning happens: (1) get readable text and structure out of the file, (2) preserve layout cues — tables, headings, reading order — so chunking is coherent, and (3) optionally extract specific fields into a schema. These are separate stages with separate failure modes.

The two-stage distinction

Stage A — Document-to-structure parsing: Convert raw bytes (PDF, DOCX, image) into clean Markdown or structured JSON that an LLM can read cheaply. The output is a faithful representation of the document, not a business object.

Stage B — Schema-constrained field extraction: Take the structured output from Stage A (or feed the raw document to a VLM directly) and extract specific fields — invoice total, signatory name, table rows — into a validated schema. This is where structured-output guarantees from the LLM layer apply (see /resources/reliable-tool-calling).

Conflating A and B is the most common architecture mistake: trying to do field extraction before the document is cleanly parsed, or running a heavyweight extraction pipeline on documents that only need Markdown for RAG.

Three parsing approaches

1. Traditional OCR + layout analysis — a pipeline of specialized models: text detection, character recognition, layout classification, table reconstruction, reading-order recovery. Battle-tested on high-volume document processing. Hard cases: multi-column PDFs, rotated scans, overlapping text/image regions, complex table spans. The cloud services below handle this tier.

2. Vision-language models (VLMs) reading documents directly — pass a page image to a multimodal LLM and ask it to return Markdown or JSON. Higher per-page cost than traditional OCR but handles layout ambiguity that rule-based pipelines miss. Best for low-to-medium volume, complex layouts, or when the extraction schema is known upfront.

3. Document-parsing libraries and services targeting LLM output — tools purpose-built to produce LLM-ready Markdown/JSON: they combine OCR, layout models, and optional VLM passes internally, exposing a clean API. The open-source tools below fall here.

Open-source and library tooling

Docling (IBM, MIT license, github.com/docling-project/docling) — converts PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, and more into Markdown or JSON. Includes TableFormer for table structure recovery, reading-order detection, and formula handling. Integrates with LangChain and LlamaIndex. Hosted under the LF AI & Data Foundation. Verified open-source.

Marker (Datalab / Vik Paruchuri, GPL-3.0 code / AI Pubs Open Rail-M weights, github.com/datalab-to/marker) — converts PDF, DOCX, PPTX, XLSX, EPUB, images to Markdown or JSON with optional JSON Schema extraction. Uses Surya internally for OCR and layout. Cloud API available at datalab.to. ~36K GitHub stars. Weights are free for research and startups under $2M revenue; commercial licensing required beyond that.

Surya OCR (Datalab, Apache 2.0 code / AI Pubs Open Rail-M weights, github.com/datalab-to/surya) — 650M-parameter model for OCR, layout analysis, reading-order detection, and table recognition in 90+ languages. Used as the backbone inside Marker. Weights free for research and startups under $5M revenue.

Unstructured (Unstructured-IO, Apache 2.0, github.com/Unstructured-IO/unstructured) — document ETL library for LLMs; partitions 40+ document types into typed elements (Title, NarrativeText, Table, Image), with 30+ connectors for data sources. Widely used for RAG ingestion pipelines. Also available as a hosted platform for production-grade workflows.

MinerU (OpenDataLab, custom Apache-2.0-based license since v3.1, github.com/opendatalab/MinerU) — high-accuracy PDF-to-Markdown/JSON engine supporting 109 languages via a VLM+OCR dual engine. Converts PDF, DOCX, PPTX, XLSX, images, and web pages; preserves tables as HTML, formulas as LaTeX. ~68K GitHub stars.

LlamaParse (LlamaIndex, cloud SaaS, llamaindex.ai/llamaparse) — managed document parsing service; four processing tiers (Fast, Cost Effective, Agentic, Agentic Plus); outputs LLM-ready Markdown and structured JSON; free tier includes ~1,000 pages/month. Backed by GPT-4.1 and Gemini 2.5 Pro for complex layouts (as of May 2025).

Cloud / managed services

AWS Textract (Amazon, SaaS, docs.aws.amazon.com/textract/) — layout-aware OCR returning blocks, key-value pairs, tables, and query responses as structured JSON. Synchronous API for images; asynchronous batch for multi-page PDFs. Native fit for AWS-first teams; tight S3 / Lambda integration.

Azure AI Document Intelligence (Microsoft, SaaS, learn.microsoft.com/azure/ai-services/document-intelligence/) — prebuilt models for invoices, receipts, tax forms, IDs, and general layout; custom models for domain-specific extraction. REST and SDK (Python, C#, Java, JS). GA version 4.0 (API 2024-11-30). Strong choice for Microsoft-stack environments.

Google Document AI (Google Cloud, SaaS, cloud.google.com/document-ai) — layout-aware OCR with pretrained processors for forms, invoices, and identity documents; custom extractor backed by Gemini 2.5 Pro (Preview, June 2025). Returns structured JSON including key-value pairs, tables, and bounding-box coordinates.

Mistral OCR (Mistral AI, SaaS API, docs.mistral.ai/models/ocr-3-25-12) — cloud OCR API accessed via the /v1/ocr endpoint or SDK. Latest version: OCR 3 (v25.12, December 2025). Returns Markdown with HTML table reconstruction; handles interleaved images, math, and complex layouts. Pricing: $2 per 1,000 pages ($1 with Batch API discount). Verified SaaS; model weights are not open.

Reducto (Reducto AI, SaaS, reducto.ai) — agentic document platform providing layout-aware OCR, parse/split/extract endpoints, and schema-grounded field extraction for production agent pipelines. 1B+ pages processed; targets enterprise accuracy requirements. Paid; funding verified ($108M Series B).

Structured field extraction (Stage B)

Once a document is parsed to Markdown or JSON, field extraction is a structured-output problem: define a JSON Schema for the target fields, run an LLM with constrained decoding or tool-calling strict mode, and validate the result. See /resources/reliable-tool-calling for the full pattern.

VLM-based extraction (pass the page image directly to a multimodal model with a schema prompt) skips Stage A but is costlier per page and harder to debug when fields are missed.

Hard cases to test: multi-column layouts, merged table cells, rotated or skewed scans, handwritten annotations, and forms with no consistent field label placement.

For high-stakes fields (financial, medical, legal), cross-validate extracted values against OCR confidence scores and route low-confidence results to a human review queue.

The agent integration angle

Parsed document output (Markdown or JSON) feeds directly into the RAG layer. Clean structure at parse time is the single biggest lever on retrieval quality. See /resources/rag-retrieval-for-agents for chunking strategies and embedding choices; and /resources/embeddings-vector-search for index selection.

Document content is an untrusted surface: a malicious PDF can embed prompt-injection payloads in OCR-readable text. Strip or escape instruction-like patterns from parsed output before inserting it into agent context. See /resources/agentic-security-checklist, section 4 (untrusted content handling).

Verified sources

Docling (IBM, MIT): https://github.com/docling-project/docling
Marker (Datalab, GPL-3.0): https://github.com/datalab-to/marker
Surya OCR (Datalab, Apache 2.0): https://github.com/datalab-to/surya
Unstructured (Apache 2.0): https://github.com/Unstructured-IO/unstructured
MinerU (OpenDataLab): https://github.com/opendatalab/MinerU
LlamaParse docs: https://developers.llamaindex.ai/python/cloud/llamaparse/tiers/
AWS Textract docs: https://docs.aws.amazon.com/textract/latest/dg/what-is.html
Azure AI Document Intelligence overview: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview
Google Document AI overview: https://cloud.google.com/document-ai/docs/overview
Mistral OCR 3 docs: https://docs.mistral.ai/models/ocr-3-25-12
Reducto agentic document platform: https://reducto.ai/

#documents #ocr #parsing #pdf #extraction #rag #agents #vlm

Category: Reference