Why building eval platforms is hard — Phil Hetzel, Braintrust

8.0K views · Apr 28, 2026 · 25:39 min · Watch on YouTube ↗

Takeaway

Evals platforms are multi-persona, multi-stage systems problems that quickly outgrow spreadsheets and homegrown loops once teams take agent quality seriously.

Summary

Braintrust positions itself as an 'agent quality' platform spanning pre-prod evals and prod observability as the same problem
Spreadsheet-based evals are a fine starting point but break down on cross-experiment comparison, analytics, and bringing non-technical SMEs in
Eval platforms are a multi-persona system problem (product/AI/systems engineers + SMEs), not a solo engineering exercise
'Vibe-code your own Braintrust' is a common but underestimated path — the iceberg of dataset management, scoring, looping, and team workflows becomes its own product

evalsobservabilitybraintrust

Original description

An eval platform is not just a test runner. You are building shared definitions of "good," reliable data pipelines, labelling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make evals credible and usable in day-to-day engineering.

Speaker info:
- https://www.linkedin.com/in/philliphetzel/