Multimodal Agents: Vision, Documents, and Screens

Guide · updated 2026-06-16 · Markdown variant

How agents perceive and reason over images: VLM mechanics, image-input APIs across major providers, open-weight VLM families, grounding/pointing, failure modes, and practical guidance for agent builders.

Multimodal agents extend the standard text-in / text-out loop with image input. A vision-language model (VLM) accepts one or more images alongside the text prompt and reasons over both jointly. This unlocks agent use cases that are impossible with text alone: reading UI screenshots, interpreting charts and diagrams, parsing scanned documents, and answering visual questions about photos or camera feeds.

This guide covers how image input works at the API level, what VLMs can and cannot do, the verified open-weight families, and practical guidance for building reliable pipelines. Computer-use (acting on UIs via screenshots) is covered at /resources/computer-use-browser-automation; document OCR pipelines are at /resources/document-extraction-for-agents.

What "multimodal" means for agents

A VLM fuses a vision encoder (typically a ViT-based model that converts an image to patch embeddings) with a language model backbone. The image embeddings are projected into the same token space as text, so the LLM sees a mixed sequence of vision tokens and text tokens and attends over both simultaneously.

Common agent use cases for image input:

UI / screenshot reading — parsing a rendered screen to identify elements, extract text, or determine what to click next. See /resources/computer-use-browser-automation.
Chart and diagram interpretation — reading bar charts, line graphs, flowcharts, or architectural diagrams embedded in reports.
Document and PDF parsing — passing page images to a VLM instead of (or alongside) an OCR step. See /resources/document-extraction-for-agents.
Photo and scene understanding — answering questions about real-world images: "Is the product label intact?", "How many items are on the shelf?"
Visual QA in agentic loops — using an image as evidence that a prior action succeeded (e.g., a confirmation screenshot after form submission).

How image input works at the API level

OpenAI (GPT-4o and variants)

Images are passed inside the messages array as content blocks of type image_url. Two delivery methods: a fully qualified HTTPS URL, or a base64-encoded data URL (data:image/png;base64,...). An optional detail parameter accepts "low", "high", or "auto". Low detail processes a fixed low-resolution version of the image (cheaper); high detail tiles the image into 512×512 pixel segments and processes each tile separately, paying more tokens proportional to image area. The auto setting lets the model choose. Images can also be referenced via the Files API using a file ID. Supported formats: JPEG, PNG, GIF, WebP. Source: platform.openai.com/docs/guides/images-vision.

Anthropic Claude (3/4 families)

Images are passed as image content blocks inside the messages array. Three delivery methods: base64-encoded image data with an explicit media_type (image/png, image/jpeg, image/gif, image/webp), a URL reference to a hosted image, or a file ID from the Files API. The API accepts up to 100 images per request (up to 20 on claude.ai). Images larger than 8000×8000 px are rejected. Large images are resized to fit the model's native resolution before processing; resolution is padded to a multiple of 28 pixels. Source: docs.anthropic.com/en/docs/build-with-claude/vision.

Google Gemini

Images are passed as inline data parts (base64 with MIME type) or via the File API (recommended for files larger than 20 MB or for reusing images across requests). Gemini models were designed multimodal from the ground up; all current Gemini models accept image input alongside text in a unified context. The File API also supports video, audio, and PDF inputs. Source: ai.google.dev/gemini-api/docs/image-understanding.

Token-cost impact

Image tokens are significantly more expensive than text tokens per pixel of information. Qualitatively: larger images, higher detail settings, and tile-based processing each increase the token count (and cost) proportionally. Downscaling images to the minimum resolution needed for the task is the primary cost lever. Exact token formulas differ by provider and model version — check provider pricing pages for current figures before building cost models.

Capabilities and limits

What VLMs can do well

Native image understanding — reading text in images (OCR-class), describing scenes, answering "what is in this image?" questions.
Chart and diagram comprehension — extracting values from bar charts, understanding flowcharts, describing architectural diagrams.
Multi-image reasoning — comparing two images, spotting differences, tracking changes across a sequence of screenshots.
Grounding and pointing — some models can return bounding-box coordinates or pixel points in response to "where is X in this image?". Molmo 2 (AI2) and Qwen3-VL both support returning spatial coordinates. This enables downstream click-targeting without a separate detection model.
Video frames — most major provider models accept video frames or short clips; check current documentation for frame-count and duration limits.

Known failure modes

Failure mode	Description
Small or low-contrast text	Dense fine-print, watermarks, or text on complex backgrounds frequently mis-read. Combine OCR + VLM for precision-critical fields.
Dense tables	Complex spanning cells and merged headers are prone to misalignment or omission. Verify extracted table content against source.
Spatial reasoning	Counting objects, estimating distances, or reasoning about exact pixel positions is unreliable without explicit grounding support.
Hallucinated visual details	VLMs sometimes describe objects or text that are not present. Treat image-derived facts as unverified until cross-checked.
Prompt injection via images	Adversarial instructions embedded as text within an image enter the model's attention pathway and may override system-prompt constraints. See /resources/agentic-security-checklist.

Verified open-weight VLM families

All families below are confirmed to exist and have open weights on Hugging Face. No benchmark rankings are published here — rankings shift frequently and depend heavily on task and resolution.

Qwen3-VL (Alibaba/QwenLM, Apache 2.0) — 2B/4B/8B/32B dense plus MoE variants (30B-A3B, 235B-A22B). Supports native image and video input; returns spatial coordinates for grounding tasks. Hugging Face: huggingface.co/Qwen/Qwen3-VL-8B-Instruct (and other sizes).
Llama 3.2 Vision (Meta, Llama Community License) — 11B and 90B vision-language models released September 2024, built on the Llama 3.1 backbone. Hugging Face: huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct.
Molmo 2 (Allen Institute for AI / AI2, permissive open license) — 4B and 8B models (Qwen 3 backbone) plus a 7B (OLMo backbone). Specializes in precise spatial grounding and pointing: the model can return pixel-level coordinates indicating where an object or element is located in an image. Video and multi-image understanding built in. Released December 2025. Hugging Face: huggingface.co/allenai/Molmo2-8B.
Pixtral (Mistral AI, model weights available) — Pixtral Large is a 124B open-weight multimodal model built on Mistral Large 2. Strong image understanding; weights released under Mistral's license. Mistral AI: mistral.ai/news/pixtral-large/ (note: listed as deprecated on Mistral's site as of mid-2026 — verify before using in new projects).
InternVL3 (Shanghai AI Lab / OpenGVLab, MIT license) — successor to InternVL 2 and 2.5; sizes from 1B to 78B. Supports multi-image input, document understanding, and tool use. Hugging Face: huggingface.co/OpenGVLab/InternVL3-78B.
Gemma 3 / Gemma 4 (Google) — Gemma 3 (1B/4B/12B/27B, multimodal with image input, released March 2025) under the custom Gemma Terms of Use; Gemma 4 (multimodal, released April 2026) is the first Gemma released under true Apache 2.0 (see /resources/open-weight-models-for-agents). Available on Hugging Face under the Google organization.

Practical guidance for agent builders

Control image size and cost. Downscale images to the minimum resolution that preserves the information your task needs before sending them to the API. A screenshot for UI parsing rarely needs more than 1280px on the long edge. Unnecessary resolution multiplies token cost with no quality gain.

Prefer structured prompts for extraction tasks. Instead of "describe this image," use prompts like "List every line item, quantity, and unit price visible in this invoice table." Specificity reduces hallucinated detail and improves parseable output.

Combine OCR + VLM for precision-critical fields. For financial, medical, or legal documents where small text or dense tables matter, run a dedicated OCR pass (see /resources/document-extraction-for-agents) and feed the OCR text alongside the page image. The VLM then reasons over both, using the OCR text as a ground-truth anchor while the image provides layout context.

Validate extracted facts. Treat all image-derived values as unverified until cross-checked against a schema or a second pass. A VLM hallucinating a dollar amount or a part number is a real production failure mode.

Use structured outputs for downstream consumption. Pair VLM calls with strict JSON Schema constraints (see /resources/reliable-tool-calling) so that extracted fields arrive in a parseable, typed form rather than free text.

Treat image content as untrusted. Text visible in an image — from a rendered web page, a user-uploaded screenshot, or a scanned document — can contain adversarial prompt-injection payloads. The model processes that text as part of its context and may act on embedded instructions. Apply the same untrusted-input mitigations as for text from the web. See /resources/agentic-security-checklist (section 1, prompt injection).

Cross-links: /resources/computer-use-browser-automation (acting on UI screenshots) · /resources/document-extraction-for-agents (OCR + document parsing pipelines) · /resources/reliable-tool-calling (structured output from VLM extraction) · /resources/agentic-security-checklist (prompt injection via images)

Verified sources

OpenAI images and vision docs: https://platform.openai.com/docs/guides/images-vision
Anthropic Claude vision docs: https://docs.anthropic.com/en/docs/build-with-claude/vision
Google Gemini image understanding docs: https://ai.google.dev/gemini-api/docs/image-understanding
Qwen3-VL-8B-Instruct (Hugging Face): https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
Llama 3.2 Vision (Hugging Face, Meta): https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct
Molmo 2 announcement (AI2): https://allenai.org/blog/molmo2
Molmo2-8B (Hugging Face, AI2): https://huggingface.co/allenai/Molmo2-8B
Pixtral Large announcement (Mistral AI): https://mistral.ai/news/pixtral-large/
InternVL3-78B (Hugging Face, OpenGVLab): https://huggingface.co/OpenGVLab/InternVL3-78B
Gemma 3 multimodal guide (Roboflow): https://blog.roboflow.com/gemma-3/
Image-based prompt injection research (CSA, 2026): https://labs.cloudsecurityalliance.org/research/csa-research-note-image-prompt-injection-multimodal-llm-2026/

#multimodal #vision #vlm #images #ocr #grounding #agents #open-weight

Category: Guide