Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

462 views · Feb 05, 2025 · 9:32 min · Watch on YouTube ↗

Takeaway

llm-eval gives teams a local, Hydra-configurable CLI to test prompt and model changes across providers with stable multi-sample reporting.

Summary

Niklas Nielsen (CTO, log10) introduces llm-eval, a CLI tool that scaffolds a prompts/tests/metrics folder structure and runs evals locally in four commands.
Built on Meta's Hydra config framework so test criteria, metrics, and model providers are configurable; metrics are arbitrary Python functions that can themselves call LLMs.
Defaults to five samples per test for stability, supports overriding to single-sample runs, and generates HTML-style reports highlighting pass/fail criteria across models.
Demos a 'what is a+b' test across Claude, GPT-4, and GPT-3.5, showing how stripping leading whitespace (Claude tends to add it) and prompt tweaks change pass rates.

evalsllmevaltools

Original description

Recorded & streamed live for the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair

About Niklas Nielsen
Nik was most recently Head of Product at MosaicML (acq. Databricks for $1.3B). Prior to that he worked at Intel and Mesosphere on building Distributed Systems, and at Adobe on the Virtual Machines and Compilers team. He co-founded CustomerDB, a startup applying AI to product management.