Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1.3K views · May 14, 2026 · 80:07 min · Watch on YouTube ↗

Takeaway

Treat observability as a 3-phase loop (evaluate → monitor → optimize) built on OpenTelemetry tracing and agent-specific evaluators — non-determinism makes this a continuous practice, not a one-time eval.

Summary

Microsoft Foundry positions agent observability across three phases: evaluate (early/build), monitor (production), optimize (continuous improvement) — the 'mind the gap' between requirements and live behavior
OpenTelemetry-based tracing in Foundry captures tool calls, messages and workflow steps; works for agents built outside Foundry that you bring in for observability
Built-in evaluators span quality, safety/risk and agent-specific metrics (tool selection accuracy, task adherence, intent resolution) plus custom evaluator support
Agents are non-deterministic; reliability requires evaluation, monitoring and optimization loops rather than one-off evals — especially as you scale to many multi-agent systems
Provides a GitHub repo of evaluator examples + AI Engineer Discord channel for community follow-up

observabilityevalsazure-foundry

Original description

Agents drift. Models change, prompts get tweaked, edge cases accumulate, and the gap between what your agent does and what you need it to do widens without you noticing. Amy and Nitya walk through Microsoft Foundry's observability stack: tracing built on OpenTelemetry, built-in evaluators for quality, safety, and agentic metrics like intent resolution and task adherence, and red teaming where a second AI attacks your agent with adversarial prompts to find vulnerabilities before your users do.

The piece worth watching for is the observe skill demo. You point it at an agent with no eval dataset, no baselines, nothing. It generates the dataset, runs batch evaluations, optimizes the prompt, compares versions, and rolls back to the best one... all from a single prompt to a coding agent. The skill shows its reasoning at each step, which is where the real value is: it surfaces the failures you didn't know to look for.

Speaker info:
- https://x.com/NityaNarasimhan
- https://www.linkedin.com/in/nityan/
- https://x.com/AmyKateNicho
- https://www.linkedin.com/in/amykatenicho/