ChangeGamer

← All resources

Evaluating AI Agents: Benchmarks and Methods

Reference · updated 2026-06-15 · Markdown variant

Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.


Evaluating a multi-step agent is fundamentally different from evaluating a single-turn LLM response. The gap matters: a model that scores 90% on a chat benchmark may fail badly as an agent if it cannot recover from tool errors, maintain state across steps, or complete long-horizon tasks reliably.

Why agent eval differs from single-turn LLM eval

Benchmark reference table

Benchmark Measures Task domain Metric Maintainer
SWE-bench Resolving real GitHub issues in Python codebases end-to-end Software engineering (12 open-source Python repos, 2,294 tasks) % resolved (pass/fail per task) Princeton NLP / SWE-bench org (github.com/swe-bench/SWE-bench)
SWE-bench Verified Same as SWE-bench but on 500 human-validated tasks (93 developers removed 68% of originals for bad harnesses, vague specs, or environment deps) Software engineering % resolved SWE-bench org + OpenAI Preparedness team (swebench.com)
GAIA Generalist assistants on 466 real-world Q&A tasks requiring multi-tool chaining (web browse, file read, code execute, reasoning) across 3 difficulty levels Generalist / cross-tool Exact-match accuracy; human baseline ~92% vs best agents ~50–55% Meta AI, HuggingFace, AutoGPT team (huggingface.co/spaces/gaia-benchmark/leaderboard)
BFCL (Berkeley Function Calling Leaderboard) Tool/function-call correctness: single, parallel, and multi-turn calls across programming languages, evaluated via AST matching Function / tool calling Overall accuracy; separate live vs non-live scores Gorilla LLM, UC Berkeley (gorilla.cs.berkeley.edu) — V4 (2025) adds agentic multi-hop and error recovery
tau-bench Agent–tool–user interaction: simulated customer conversations requiring API tool use and policy adherence; measures consistency across multiple trials Customer service (airline, retail) pass^k — probability all k i.i.d. trials succeed; highlights reliability variance Sierra Research (github.com/sierra-research/tau-bench)
tau2-bench Dual-control extension of tau-bench: agent must guide a simulated user toward a shared goal; adds telecom domain and voice/duplex evaluation Customer service (airline, retail, telecom) pass^k; voice and text variants Sierra Research (github.com/sierra-research/tau2-bench)
WebArena Web-browsing agents on 812 realistic tasks across self-hosted shopping, forum, GitLab, and CMS sites Web navigation Task success rate CMU / web-arena-x org (github.com/web-arena-x/webarena)
VisualWebArena Multimodal web agents on 910 visually grounded tasks (image + text inputs) across shopping, classifieds, Reddit Visual web navigation Task success rate; human baseline ~88.7%, best agents ~16–30% CMU (github.com/web-arena-x/visualwebarena)
AgentBench LLM-as-agent across 8 diverse environments: OS, database, knowledge graph, web shopping, web browsing, card game, lateral puzzles, household Multi-environment / generalist Overall score (normalized per environment) THUDM / Tsinghua University (github.com/THUDM/AgentBench) — ICLR 2024
MLE-bench ML engineering: agents solve 75 Kaggle competitions end-to-end (train models, prepare data, run experiments) Machine learning engineering % competitions earning any medal vs Kaggle human leaderboard OpenAI (github.com/openai/mle-bench) — ICLR 2025
OSWorld Computer-use agents on 369 desktop tasks across Ubuntu, Windows, macOS — real apps (LibreOffice, VS Code, GIMP, Chrome, Thunderbird) in a VM Desktop / GUI computer use Task success rate (execution-based in VM snapshot) OSWorld team (os-world.github.io)

Evaluation methods beyond benchmarks

Trajectory / step-level evaluation

Instead of measuring only final task success, score each intermediate step: did the agent call the right tool? Were arguments correct? Did it recover from an error? Step-level scoring gives partial credit and pinpoints failure modes (bad tool selection vs bad argument generation vs bad recovery logic).

LLM-as-judge

An LLM evaluates another LLM's output using a rubric. Scales better than human annotation for large eval sets. Known failure modes:

Use LLM-as-judge for open-ended subjective quality; use exact-match or execution-based eval for tool calls and code.

Rubric / criteria-based grading

Define explicit scoring criteria before running the eval (correctness, safety, tool-use efficiency, instruction following). Apply the rubric consistently — either by a human or an LLM judge prompted with the rubric. Rubric-based grading reduces judge variance and produces auditable scores.

Ground-truth / exact-match for tool calls

For tool-call evaluation, compare the agent's emitted call against a reference: tool name match, argument schema match, and optionally argument value match. BFCL uses AST matching for this. Exact-match is the most reliable method for function-call correctness; it does not require a judge and is not subject to verbosity or position bias.

pass@k vs pass^k

Online vs offline eval

Human eval

The ground truth for open-ended tasks. Use for calibrating automated evals, not as the primary scalable signal. Human baseline scores on GAIA (~92%), WebArena (~78%), and VisualWebArena (~88.7%) give the ceiling most public benchmarks are still far below.

Practical guidance

Verified sources

#evaluation #benchmarks #agents #evals #tool-calling #trajectory

Category: Reference