← back
Why building eval platforms is hard — Phil Hetzel, Braintrust
Takeaway
Evals platforms are multi-persona, multi-stage systems problems that quickly outgrow spreadsheets and homegrown loops once teams take agent quality seriously.
Summary
- Braintrust positions itself as an 'agent quality' platform spanning pre-prod evals and prod observability as the same problem
- Spreadsheet-based evals are a fine starting point but break down on cross-experiment comparison, analytics, and bringing non-technical SMEs in
- Eval platforms are a multi-persona system problem (product/AI/systems engineers + SMEs), not a solo engineering exercise
- 'Vibe-code your own Braintrust' is a common but underestimated path — the iceberg of dataset management, scoring, looping, and team workflows becomes its own product
evalsobservabilitybraintrust
Original description
An eval platform is not just a test runner. You are building shared definitions of "good," reliable data pipelines, labelling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make evals credible and usable in day-to-day engineering. Speaker info: - https://www.linkedin.com/in/philliphetzel/