← back
Evals Are Not Unit Tests — Ido Pesok, Vercel v0
Takeaway
Treat application evals as statistical measurements over real user traffic, not unit tests — prompt tweaks alone never close the demo-to-prod gap.
Summary
- Ido Pesok (Vercel v0) uses a 'fruit letter counter' app to show why prompt engineering alone never reaches reliability — GPT-4.1 will pass 10 manual tests then fail on user input #11.
- v0 just crossed 100M messages and recently shipped GitHub sync (push generated code, pull changes, switch branches, open PRs).
- Argues application-layer evals are not unit tests: they're probabilistic, must reflect real user data, and need continuous monitoring rather than pass/fail gates.
- Demos the demo-to-prod gap: AI apps look great in demos and hallucinate as soon as real users hit them, so evals must be data-driven and adversarial.
evalsv0llm-apps
Original description
How to think about evaluating a non-deterministic system — and how to actually succeed at it. About Ido Pesok Ido Pesok is an engineer and researcher at Vercel, working on the AI behind v0 and focused on building reliable and intuitive AI systems. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter Timestamps: 00:00 Introduction to Vercel's V0 and its growth 01:00 The problem with AI unreliability 02:44 The "Fruit Letter Counter" app example of AI failure 03:33 Introducing "evals" and the basketball court analogy 05:09 Defining the "court": understanding the domain of user queries 07:53 Data collection for evals 09:13 Structuring evals: constants in data, variables in task 10:45 Scoring evals 12:35 Integrating evals into CI/CD 13:40 The benefits of using evals