← back
[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)
Takeaway
Eval is a methodology discipline (calibrate metrics with humans, expand to many signals) not a one-shot benchmark — learn from Google Search's 300-metric setup.
Summary
- David Karam and co-presenter (both ex-Google Search >10 years, now Pi Labs) run a hands-on eval methodology workshop drawing on Google's quality-evaluation tradition.
- Pushes back on the typical '4 metrics' setup — Google Search ran 300 metrics; encourages dramatically expanding scope and calibrating metrics against human raters and user data.
- Eval methodology pipeline: vibe testing → human evals (expensive, often skipped) → code-based evals → LLM-as-judge for natural-language criteria; emphasizes scoring systems with correlated signals rather than a comprehensive metric suite from day one.
- LLM-as-judge fights decoder-model creativity — judges shouldn't be 'creative'; need careful prompt and scoring design.
- Frames evals as continuous, going all the way from offline tests through online feedback loops (thumbs up/down user data), not a one-time benchmarking task.
evalsmetricsmethodology
Original description
One of the biggest challenges in building evals you can trust is building metrics that reliably measure goodness in your application; metrics that are highly accurate, rapid fast, and tunable to ground truth rater and user behavior. This workshop is inspired by decades of AI and machine learning development in Google Search, reinvented for the modern LLM stack by the Pi team over the past year. In this workshop you will learn how to: 1. Brainstorm and design custom metrics tailored to your specific application needs. 2. Identify which types of signals (natural language, code, other models) work best for your use case through rapid trial and error. 3. Combine & calibrate your metrics against ground truth data using real examples from your domain. 4. Use simple tools like Google Sheets for visualizing and analyzing your inputs and outputs with those metrics. 5. Integrate your scoring models into both online workflows like agent control and offline ones like model comparison and training evaluation. About David Karam I'm David K. I love straddling the line between deep tech research and application development. I’ve spent a decade at Google as Product Director working on Search’s core AI and NLU systems, helping Search’s own version of “AI Engineers” develop magical applications. Around a year ago I left with my cofounder to start Pi Labs where we’re trying to bring that same spirit to the rest of the industry. Outside work I love to read, cook, and spend time in nature. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter