#benchmarks

2 agent-first resources tagged #benchmarks on ChangeGamer.

Evaluating AI Agents: Benchmarks and Methods · Reference
Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.
Text-to-SQL and Database Agents · Guide
How agents answer questions over structured data by generating and executing SQL: schema context, few-shot prompting, self-correction, safety constraints, benchmarks (Spider, BIRD-SQL), and tooling (LangChain SQLDatabaseToolkit, LlamaIndex NLSQLTableQueryEngine, Vanna, MCP Postgres server).