How to evaluate a model for your use case: Emmanuel Turlay

248 views · Feb 05, 2025 · 7:32 min · Watch on YouTube ↗

Takeaway

Use LLM-as-judge with task-specific rubrics and visualize score distributions — generic NLP benchmarks won't tell you which model fits your application.

Summary

Sematic/AirTrain CEO Emmanuel Turlay covers why generic LLM benchmarks (GLUE, HellaSwag, TriviaQA, ARC) and surface metrics (BLEU, ROUGE, density, coverage) don't tell you how a model performs on your task (e.g., symptom extraction, recipe-to-ingredients, schema-to-API payload).
Recommends LLM-as-judge: feed the candidate's outputs plus task description and 1-10 grading criteria into a stronger scoring model to produce a distribution of per-example scores.
Notes GPT-4 is the strongest judge but expensive; Flan-T5 is a good speed/correctness trade-off for large eval sets.
AirTrain (their product) lets you upload a dataset, pick models to compare (Llama 2, Falcon, Flan-T5, your own), define properties to measure, and visualize metric distributions side by side.

evalsllm-as-judgeairtrain

Original description

Fine-tuning LLMs requires a lot of resources, both memory and GPU, which are notoriously costly. In this talk, I will describe five ways to minimize resource usage, and to find the cheapest resources out there to fine-tune LLMs with a tight budget.

Recorded & streamed live for the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair

About Emmanuel
I started my career in academia 15 years ago doing particle physics research at CERN. I moved to the US in 2014 and joined Instacart as a Sr. Software Engineer. I lead teams around payments, orders, and MLOps. In 2018 I joined Cruise, where I started the ML Infrastructure team which grew to about 80 engineers. In 2022, I founded Sematic, an open-source ML Infrastructure company.