← back
Everything You Need To Know About Agent Observability — Danny Gollapalli & Zubin Koticha, Raindrop
Takeaway
Move from offline eval sets to production monitoring with implicit/explicit signals and live experiments — agents fail in ways unit tests can't anticipate.
Summary
- Raindrop founders argue evals (golden test sets) don't cut it for agents whose tool/sub-agent/memory combinatorics produce undefined behavior; monitoring is the new primary discipline.
- Two signal types: explicit (error rate, latency, regenerations, cost) and implicit (refusal/task-failure/user-frustration classifiers, regex like Claude Code's leaked user_prompt_keywords.ts).
- Recommend binary issue classifiers over 1-10 LLM-as-judge scoring; live experiments comparing a control vs. treatment cohort surface real impact of prompt/model/tool changes.
- Case: shipping prompt v2.4 dropped user-frustration rate from 37% to 9% while tool-use count rose — useful diagnostic data.
observabilityagentsevals
Original description
Agent failures do not look like normal software failures. In this workshop, the Raindrop team breaks down what it actually takes to monitor production agents, from explicit signals like tool errors, latency, and cost to fuzzier signals like user frustration, refusals, task failure, and capability gaps. The session covers how to move beyond evals toward real production observability, how to use classifiers, regex, and experiments to catch regressions, and how to instrument self-diagnostics so agents can report their own failures and strange behavior. If you're running agents in production, this is a practical framework for understanding what is going wrong and how to catch it early. Speaker info: - https://x.com/benhylak - https://www.linkedin.com/in/zkoticha - https://www.linkedin.com/in/joseph-daniel-gollapalli-a371a4138/ Timestamps 0:14 Introduction and the problem of agent failures 1:48 Moving from evals to production monitoring 3:33 The two types of signals: explicit and implicit 4:47 Using classifier signals for observability 6:38 Leveraging regex for signal detection 7:30 Using experiments to validate improvements 9:42 Q&A session: Statistical relevance and experimental design 16:07 Introduction to self-diagnostics 20:15 Workshop: Coding agent demonstration 24:01 Live demo: Triggering and handling tool failure 30:26 Best practices for self-diagnostic implementation 32:20 Q&A: Real-world use cases and triage 40:02 Q&A: Managing fast-paced experimentation 44:21 Q&A: Trace visualization and data export