Judging LLMs: Alex Volkov

2.3K views · Sep 09, 2024 · 18:39 min · Watch on YouTube ↗

Takeaway

Production LLM apps must log/trace everything from day one, and use a layered eval stack (programmatic + human + LLM-judge) rather than skipping straight to fine-tuning.

Summary

Alex Volkov (W&B AI evangelist, ThursdAI host) frames the talk as a courtroom drama judging hypothetical AI engineers, then teaches LLM eval best practices.
Three eval methods: programmatic (numerical comparison, assertions like HumanEval — cheap, scales, misses multi-turn); human-in-the-loop (vibe testing during development); LLM-as-judge (most scalable for natural language).
Case 'turbulence on production' references the real Air Canada chatbot lawsuit — programmatic assertions can't catch customer-facing reasoning failures; you need human-in-the-loop evals.
Case 'premature fine-tuning' warns against jumping to fine-tuning before iterating on prompts/chain-of-thought/flow engineering/SPI/MOA — W&B Models integrates with OpenAI, Mistral, Together, Axolotl, HF Trainer fine-tuning.
Promotes W&B Weave for one-line tracing/logging of LLM calls (decorator), tracking inputs/outputs, system prompts, temperature, multi-turn conversations.

evalsweights-biasestracing

Original description

All rise! The honorable LLM Judge presiding. On the docket today, many cases of AI Engineers building with LLMs, without the ability to iterate, evaluate and improve their products.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Alex
Alex Volkov is an AI Evangelist at Weights & Biases as well as the founder and host of ThursdAI, a weekly newsletter and podcast that explores the latest innovations in AI, their practical applications, and the open-source AI community. Alex is an AI startup founder with 20 years of full-stack software engineering experience, offering a deep well of insights into AI innovation. He’s celebrated for his ability to clarify and summarize the complexities of the rapid AI advances and advocating for its beneficial uses.