← back

Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily

3.1K views · Jul 29, 2025 · 20:33 min · Watch on YouTube ↗
Takeaway

Evaluating production AI search needs live LLM-judge monitoring on real traffic, not just static benchmarks like SimpleQA, because both the web and user intent keep moving.

Summary

  • Tavily processes hundreds of millions of agent search requests per month; Quotient AI built an eval framework to monitor live AI search agents without waiting on ground-truth labels or human feedback.
  • Two principles: the web (and thus ground truth) is constantly shifting, and correctness is often subjective — depends on source, timing, and user intent.
  • Offline eval starts with static datasets like OpenAI's SimpleQA but must be complemented by live monitoring because users ask malformed questions with hidden context.
  • Quotient's expert-evaluator LLMs detect objective failures (hallucinations, retrieval misses, reasoning errors) in production agent traces.
evalssearchtavily
Original description
AI search is becoming the front door to information, whether through Retrieval-Augmented Generation (RAG), Search-Augmented Generation (SAG), or custom agents that synthesize answers on top of indexed content. As users rely more heavily on these systems, evaluating their quality becomes mission-critical. But traditional metrics like precision and recall don’t capture the full picture.

In this talk, we introduce a practical evaluation framework for AI-powered search, across three dimensions:
- Are the retrieved sources relevant to the query?
- And is the final answer complete?
- Are the sources faithfully used in the generated answer?

We’ll share lessons from working with search companies and present early findings from a new benchmark evaluating popular augmented AI systems across these dimensions. Rather than ranking winners and losers, we explore where different systems excel or break down, and how these tradeoffs inform product decisions.

This talk is for AI engineers and product teams who want to build trusted, high-quality AI search experiences, and need a way to measure if it’s actually working.

About Julia Neagu
Julia is the co-founder and CEO of Quotient AI, which provides intelligent observability for AI apps by automatically detecting failures, uncovering root causes, and recommending improvements. Before Quotient, she was the Director of Data for Copilot, GitHub's AI pair programmer, where her team built the systems evaluating the large language models behind Copilot. Previously, she was the Director of Analytics at Tamr and led end-to-end quantitative modeling at Aon's Intellectual Property Solutions group. Julia has a PhD and MA in Physics from Harvard, an AB in Physics from Princeton.

About Deanna Emery
Deanna is the Founding AI Researcher at Quotient AI, where she is leading research on evaluation of Large Language Models in real-world products and applications. Before Quotient, Deanna was a Principal Data Scientist at Aon, where she led the team building language models for valuation of intellectual property assets. She began her career as a researcher at Harvard-Smithsonian Center for Astrophysics and Caltech LIGO. Deanna has a MS in Machine Learning from UC Berkeley and BA in Physics from Harvard University. She is passionate about diversity and inclusion in STEM; she has conducted research on diversity in named patent inventors, working with companies to measure and address diversity gaps, and she is an active board member at a STEM education non-profit.

About Maitar Asher
Maitar Asher is a founding member and Head of Engineering at Tavily, a New York–based startup developing a web infrastructure layer for AI agents.

She leads the technology build and has architected core systems—including Tavily’s intelligent caching layer and enhanced search retrieval—to power the industry’s premier search engine for large language models.

Prior to Tavily, she developed deep learning tools for PET/CT image segmentation as a Machine Learning Research Engineer at Stanford University. She holds a B.S. in Computer Science (Machine Learning) from Columbia University.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter