← back

Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

2.3K views · May 14, 2026 · 124:18 min · Watch on YouTube ↗
Takeaway

Eval is just testing for non-deterministic systems — capture traces, look at them, then write code + LLM-judge + meta-evals before tuning prompts or swapping models.

Summary

  • Lori Voss (Arize/Phoenix, ex-npm) runs a hands-on workshop: capture traces and spans, look at the actual data, then write code evals, built-in Phoenix evals, LLM-as-judge evals, and meta-evals.
  • Tests fail because users use vocabulary the agent doesn't expect; basic string match doesn't work because correct outputs have a large valid space.
  • Faithfulness evals catch hallucination drift introduced by prompt changes; eval suites enable safe model upgrades (Sonnet 4.5 → 4.6) without re-testing everything manually.
  • Real shippers (Descript, Bolt, Anthropic's Claude Code) all started fast, then institutionalized eval suites; covers datasets, experiments, pairwise eval, reliability scoring, and the data flywheel.
evalsagentstracing
Original description
Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis agent: tracing with Phoenix, reading traces before writing a single eval, categorizing failures by root cause, then building code evals, built-in LLM-as-a-judge evals, and a custom rubric with labeled examples.

The sharpest lesson: choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13, because the model doesn't know what year it is and can't verify forward-looking financial data. The workshop closes on the thing most eval content skips — experiments that let you prove a prompt change actually worked, rather than eyeballing it and calling it a win.

Speaker info:
- https://x.com/seldo
- https://www.linkedin.com/in/seldo/
- https://github.com/seldo