← back

Agent Evals: Finally, With The Map

3.1K views · Feb 22, 2025 · 13:31 min · Watch on YouTube ↗
Takeaway

A complete agent eval program covers both semantic and behavioral dimensions and treats the LLM-judge layer (EvalOps) as a first-class optimization target.

Summary

  • Ari Heljakka (Root Signals) presents a map of agent evaluation split into semantic quality (representations vs reality) and behavioral quality (actions vs goals).
  • Semantic axis: single-turn virtues (coherence, safety, policy adherence) and multi-turn (chat history consistency, reasoning trace evaluation) — RAG faithfulness grounds to reference data.
  • Behavioral axis: single-step (instruction following, tool selection, output format, error handling) and multi-step (action convergence, plan consistency) — goal achievement is the ultimate utility metric.
  • Practical considerations: cost/latency optimization, tracing/debugging, offline vs online testing, and tool-specific metrics implementable as traditional API tests.
  • Coins 'EvalOps' — a special case of LLMOps — to handle the 'double tier' problem of also optimizing the LLM-as-judge layer's cost, latency, and uncertainty.
evalsagentsllm-as-judge
Original description
A systematic and principled map of the key aspects of AI Agent Evaluation is presented. Agent Evals are often approached as a laundry list of ad hoc metrics, making it hard to plan ahead towards a comprehensive quality assurance for your agents. In contrast, this presentation directly provides you with a solid foundation for your agent evaluation roadmap, towards making your agents reliable, effective and safe. 
https://rootsignals.ai/agentevals