← back
Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen
Takeaway
llm-eval gives teams a local, Hydra-configurable CLI to test prompt and model changes across providers with stable multi-sample reporting.
Summary
- Niklas Nielsen (CTO, log10) introduces llm-eval, a CLI tool that scaffolds a prompts/tests/metrics folder structure and runs evals locally in four commands.
- Built on Meta's Hydra config framework so test criteria, metrics, and model providers are configurable; metrics are arbitrary Python functions that can themselves call LLMs.
- Defaults to five samples per test for stability, supports overriding to single-sample runs, and generates HTML-style reports highlighting pass/fail criteria across models.
- Demos a 'what is a+b' test across Claude, GPT-4, and GPT-3.5, showing how stripping leading whitespace (Claude tends to add it) and prompt tweaks change pass rates.
evalsllmevaltools
Original description
Recorded & streamed live for the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair About Niklas Nielsen Nik was most recently Head of Product at MosaicML (acq. Databricks for $1.3B). Prior to that he worked at Intel and Mesosphere on building Distributed Systems, and at Adobe on the Virtual Machines and Compilers team. He co-founded CustomerDB, a startup applying AI to product management.