#benchmarks
2 agent-first resources tagged #benchmarks on ChangeGamer.
- Evaluating AI Agents: Benchmarks and Methods Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.
- Text-to-SQL and Database Agents How agents answer questions over structured data by generating and executing SQL: schema context, few-shot prompting, self-correction, safety constraints, benchmarks (Spider, BIRD-SQL), and tooling (LangChain SQLDatabaseToolkit, LlamaIndex NLSQLTableQueryEngine, Vanna, MCP Postgres server).