Evals Are Not Unit Tests — Ido Pesok, Vercel v0

13.9K views · Aug 06, 2025 · 15:21 min · Watch on YouTube ↗

Takeaway

Treat application evals as statistical measurements over real user traffic, not unit tests — prompt tweaks alone never close the demo-to-prod gap.

Summary

Ido Pesok (Vercel v0) uses a 'fruit letter counter' app to show why prompt engineering alone never reaches reliability — GPT-4.1 will pass 10 manual tests then fail on user input #11.
v0 just crossed 100M messages and recently shipped GitHub sync (push generated code, pull changes, switch branches, open PRs).
Argues application-layer evals are not unit tests: they're probabilistic, must reflect real user data, and need continuous monitoring rather than pass/fail gates.
Demos the demo-to-prod gap: AI apps look great in demos and hallucinate as soon as real users hit them, so evals must be data-driven and adversarial.

evalsv0llm-apps

Original description

How to think about evaluating a non-deterministic system — and how to actually succeed at it.

About Ido Pesok
Ido Pesok is an engineer and researcher at Vercel, working on the AI behind v0 and focused on building reliable and intuitive AI systems.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps:

00:00 Introduction to Vercel's V0 and its growth
01:00 The problem with AI unreliability
02:44 The "Fruit Letter Counter" app example of AI failure
03:33 Introducing "evals" and the basketball court analogy
05:09 Defining the "court": understanding the domain of user queries
07:53 Data collection for evals
09:13 Structuring evals: constants in data, variables in task
10:45 Scoring evals
12:35 Integrating evals into CI/CD
13:40 The benefits of using evals