← back
Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft
Takeaway
Use Azure AI Toolkit + Evaluation SDK to spot-check models, then scale to dataset evals, treating evaluation as a layered application-level concern.
Summary
- Cedric Vidal frames four-layer evaluation: model, safety system, system message/grounding, user experience — real safety comes from layering app-level mitigations.
- Demos VS Code's AI Toolkit extension for side-by-side model comparison (GPT-4.1 vs GPT-4o for a panna cotta recipe), agent creation with Playwright MCP, dataset-driven batch evaluation.
- Built a Luma event-page extraction agent that combines info from reactor and Luma pages; thumbs up/down then exported as JSONL for automated harnesses.
- Recommends starting evaluation at the start of the project, not after building; Azure AI Evaluation SDK enables scaling beyond local batches.
evalsazureagents
Original description
As AI agents transition from experimental assistants to critical components of enterprise workflows, reliably evaluating their performance becomes essential. But how do you systematically measure an AI agent’s capabilities, contextual understanding, and accuracy across diverse scenarios? In this talk, we'll dive deep into the Azure AI Evaluation SDK, an innovative tool designed to rigorously assess agentic applications. Learn how to create powerful evaluations using structured test plans, scenarios, and advanced analytics that pinpoint strengths and expose hidden weaknesses. Through practical examples and real-world case studies, you'll discover how companies are already leveraging this SDK to enhance agent trustworthiness, reliability, and performance. Whether you're developing conversational agents, data-driven decision-makers, or autonomous workflow orchestrators, this session equips you with the techniques and insights needed to ensure your AI solutions deliver exceptional value and exceed user expectations." About Cedric Vidal Cedric Vidal is a Principal AI Advocate at Microsoft, specializing in Generative AI 🤖, and the startup 🚀 and research 🔬 ecosystems. He is dedicated to promoting AI in startups and facilitating the transition of research and startup products to the market. If you're an AI Startup Founder or Engineer, I'd like to feature your work, come talk to me. Before his current role, Cedric spent 4 years as an Engineering Manager in the AI data labeling space for the self-driving 🚕 industry at Argo AI (now re-spawned as Latitude AI). He also served as the CTO of the Fintech AI SAAS startup Quicksign and worked as a software engineering services consultant for major Fintech enterprises for 10 years. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter