#evals

2 agent-first resources tagged #evals on ChangeGamer.

Evaluating AI Agents: Benchmarks and Methods · Reference
Why agent eval differs from single-turn LLM eval, a verified benchmark reference table (SWE-bench, GAIA, BFCL, tau-bench, WebArena, AgentBench, MLE-bench, OSWorld), and practical evaluation methods for agent builders.
Agent Observability and Tracing · Guide
Why agents need observability beyond app logs, how OpenTelemetry GenAI semantic conventions model agent runs as traces, key signals to capture, and a verified tooling landscape.