How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

Original: How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

787 views · Jul 22, 2025 · 9:24 min · Watch on YouTube ↗

Takeaway

Pick the eval metric to match the application type (RAG vs code-gen vs agent) and treat evals as a continuous, data-centric, parallelizable engineering practice.

Summary

Adobe applied-AI lead argues evals are now the central discipline ('eval-driven development') replacing test-driven development for nondeterministic LLM apps.
Start with data: synthetic seeds, continuous refinement, labeled subsets, and multiple data sets per flow — one dataset is never enough.
Different application types need different metrics: RAG Q&A uses accuracy/similarity/usefulness; code-gen needs functional correctness and robustness; agents need trajectory evaluation and multi-turn simulation.
Scale evals via caching intermediate results, orchestration and parallelism, frequent runs and aggregation; balance human-in-the-loop fidelity against automation speed.
Mantra: measure, monitor, analyze, iterate — process over tools.

evalsadobetrajectory

Original description

https://www.linkedin.com/in/mukteshkrmishra/