How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

19.9K views · Sep 19, 2024 · 18:44 min · Watch on YouTube ↗

Takeaway

Domain-specific eval systems are built bottom-up from unit tests, trace logging and frictionless human review — not from buying generic tools or jumping straight to LLM-as-judge.

Summary

Emil Sedgh (CTO Rechat) describes shipping 'Lucy', an AI assistant for real estate agents originally prototyped with GPT-3.5 + ReAct; vibe-checks worked for MVP but stalled progress past demo.
Hamel Husain prescribes a layered eval recipe: start with cheap unit-test assertions on observed failure modes (e.g., emails not sent, invalid placeholders), run them in CI, and log results to whatever you already have (Rechat used Metabase) rather than buying a tool.
Trace logging plus human review is non-negotiable; Rechat built a custom Gradio/Shiny-style data viewer with domain-specific filters because off-the-shelf trace tools had too much friction — 'fight as hard as you can to remove all friction from looking at data.'
Bootstrap test cases by having an LLM role-play as a real estate agent to synthetically generate inputs across features/tools for coverage.
The eval framework unlocks fine-tuning data curation 'almost for free' — failed traces become labeled training data and a feedback loop drives continuous quality improvement.

evalsdomain-specifichuman-review

Original description

Many failed AI products share a common root cause: a failure to create robust evaluation systems. Evaluation systems allow you to improve your AI quickly in a systematic way and unlock superpowers like the ability to curate data for fine-tuning. However, many practitioners struggle with how to construct evaluation systems that are specific to their problems.

In this talk, we will walk through a detailed example of how to construct domain-specific evaluation systems.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Hamel
Hamel Husain started working with language models five years ago when he led the team that created CodeSearchNet, a precursor to GitHub CoPilot. Since then, he has seen many successful and unsuccessful approaches to building LLM products. Hamel is also an active open source maintainer and contributor of a wide range of ML/AI projects. Hamel is currently an independent consultant.

About Emil
Emil is CTO at Rechat, where he leads the development of Lucy, an AI personal assistant designed to support real estate agents.