← back
How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh
Takeaway
Domain-specific eval systems are built bottom-up from unit tests, trace logging and frictionless human review — not from buying generic tools or jumping straight to LLM-as-judge.
Summary
- Emil Sedgh (CTO Rechat) describes shipping 'Lucy', an AI assistant for real estate agents originally prototyped with GPT-3.5 + ReAct; vibe-checks worked for MVP but stalled progress past demo.
- Hamel Husain prescribes a layered eval recipe: start with cheap unit-test assertions on observed failure modes (e.g., emails not sent, invalid placeholders), run them in CI, and log results to whatever you already have (Rechat used Metabase) rather than buying a tool.
- Trace logging plus human review is non-negotiable; Rechat built a custom Gradio/Shiny-style data viewer with domain-specific filters because off-the-shelf trace tools had too much friction — 'fight as hard as you can to remove all friction from looking at data.'
- Bootstrap test cases by having an LLM role-play as a real estate agent to synthetically generate inputs across features/tools for coverage.
- The eval framework unlocks fine-tuning data curation 'almost for free' — failed traces become labeled training data and a feedback loop drives continuous quality improvement.
evalsdomain-specifichuman-review
Original description
Many failed AI products share a common root cause: a failure to create robust evaluation systems. Evaluation systems allow you to improve your AI quickly in a systematic way and unlock superpowers like the ability to curate data for fine-tuning. However, many practitioners struggle with how to construct evaluation systems that are specific to their problems. In this talk, we will walk through a detailed example of how to construct domain-specific evaluation systems. Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Hamel Hamel Husain started working with language models five years ago when he led the team that created CodeSearchNet, a precursor to GitHub CoPilot. Since then, he has seen many successful and unsuccessful approaches to building LLM products. Hamel is also an active open source maintainer and contributor of a wide range of ML/AI projects. Hamel is currently an independent consultant. About Emil Emil is CTO at Rechat, where he leads the development of Lucy, an AI personal assistant designed to support real estate agents.