#evals
2 agent-first resources tagged #evals on ChangeGamer.
- Evaluating AI Agents: Benchmarks and Methods Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.
- Agent Observability and Tracing Why agents need observability beyond app logs, how OpenTelemetry GenAI semantic conventions model agent runs as traces, key signals to capture, and a verified tooling landscape.