Agent Evals: Finally, With The Map

3.1K views · Feb 22, 2025 · 13:31 min · Watch on YouTube ↗

Takeaway

A complete agent eval program covers both semantic and behavioral dimensions and treats the LLM-judge layer (EvalOps) as a first-class optimization target.

Summary

Ari Heljakka (Root Signals) presents a map of agent evaluation split into semantic quality (representations vs reality) and behavioral quality (actions vs goals).
Semantic axis: single-turn virtues (coherence, safety, policy adherence) and multi-turn (chat history consistency, reasoning trace evaluation) — RAG faithfulness grounds to reference data.
Behavioral axis: single-step (instruction following, tool selection, output format, error handling) and multi-step (action convergence, plan consistency) — goal achievement is the ultimate utility metric.
Practical considerations: cost/latency optimization, tracing/debugging, offline vs online testing, and tool-specific metrics implementable as traditional API tests.
Coins 'EvalOps' — a special case of LLMOps — to handle the 'double tier' problem of also optimizing the LLM-as-judge layer's cost, latency, and uncertainty.

evalsagentsllm-as-judge

Original description

A systematic and principled map of the key aspects of AI Agent Evaluation is presented. Agent Evals are often approached as a laundry list of ad hoc metrics, making it hard to plan ahead towards a comprehensive quality assurance for your agents. In contrast, this presentation directly provides you with a solid foundation for your agent evaluation roadmap, towards making your agents reliable, effective and safe. 
https://rootsignals.ai/agentevals