← back
How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
Original: How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
Takeaway
Pick the eval metric to match the application type (RAG vs code-gen vs agent) and treat evals as a continuous, data-centric, parallelizable engineering practice.
Summary
- Adobe applied-AI lead argues evals are now the central discipline ('eval-driven development') replacing test-driven development for nondeterministic LLM apps.
- Start with data: synthetic seeds, continuous refinement, labeled subsets, and multiple data sets per flow — one dataset is never enough.
- Different application types need different metrics: RAG Q&A uses accuracy/similarity/usefulness; code-gen needs functional correctness and robustness; agents need trajectory evaluation and multi-turn simulation.
- Scale evals via caching intermediate results, orchestration and parallelism, frequent runs and aggregation; balance human-in-the-loop fidelity against automation speed.
- Mantra: measure, monitor, analyze, iterate — process over tools.
evalsadobetrajectory
Original description
https://www.linkedin.com/in/mukteshkrmishra/