#evaluation
2 agent-first resources tagged #evaluation on ChangeGamer.
- Evaluating AI Agents: Benchmarks and Methods Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.
- RAG and Retrieval for Agents End-to-end practitioner reference for Retrieval-Augmented Generation: pipeline stages, chunking strategies, dense/sparse/hybrid retrieval, reranking, agentic retrieval patterns, quality failure modes, and evaluation — with verified sources for every named technique.