Everything You Need To Know About Agent Observability — Danny Gollapalli & Zubin Koticha, Raindrop

4.9K views · May 07, 2026 · 50:25 min · Watch on YouTube ↗

Takeaway

Move from offline eval sets to production monitoring with implicit/explicit signals and live experiments — agents fail in ways unit tests can't anticipate.

Summary

Raindrop founders argue evals (golden test sets) don't cut it for agents whose tool/sub-agent/memory combinatorics produce undefined behavior; monitoring is the new primary discipline.
Two signal types: explicit (error rate, latency, regenerations, cost) and implicit (refusal/task-failure/user-frustration classifiers, regex like Claude Code's leaked user_prompt_keywords.ts).
Recommend binary issue classifiers over 1-10 LLM-as-judge scoring; live experiments comparing a control vs. treatment cohort surface real impact of prompt/model/tool changes.
Case: shipping prompt v2.4 dropped user-frustration rate from 37% to 9% while tool-use count rose — useful diagnostic data.

observabilityagentsevals

Original description

Agent failures do not look like normal software failures. In this workshop, the Raindrop team breaks down what it actually takes to monitor production agents, from explicit signals like tool errors, latency, and cost to fuzzier signals like user frustration, refusals, task failure, and capability gaps.

The session covers how to move beyond evals toward real production observability, how to use classifiers, regex, and experiments to catch regressions, and how to instrument self-diagnostics so agents can report their own failures and strange behavior. If you're running agents in production, this is a practical framework for understanding what is going wrong and how to catch it early.

Speaker info:
- https://x.com/benhylak
- https://www.linkedin.com/in/zkoticha
- https://www.linkedin.com/in/joseph-daniel-gollapalli-a371a4138/

Timestamps

0:14 Introduction and the problem of agent failures
1:48 Moving from evals to production monitoring
3:33 The two types of signals: explicit and implicit
4:47 Using classifier signals for observability
6:38 Leveraging regex for signal detection
7:30 Using experiments to validate improvements
9:42 Q&A session: Statistical relevance and experimental design
16:07 Introduction to self-diagnostics
20:15 Workshop: Coding agent demonstration
24:01 Live demo: Triggering and handling tool failure
30:26 Best practices for self-diagnostic implementation
32:20 Q&A: Real-world use cases and triage
40:02 Q&A: Managing fast-paced experimentation
44:21 Q&A: Trace visualization and data export