Mission-Critical Evals at Scale (Learnings from 100k medical decisions)

3.4K views · Feb 22, 2025 · 12:14 min · Watch on YouTube ↗

Takeaway

Real-time reference-free evals (LLM-as-judge + confidence) prioritize human review where it matters and let mission-critical AI scale beyond what clinicians could ever cover.

Summary

Anterior runs prior-authorization AI for insurance covering 50M American lives, processing >100k medical decisions/day with zero tolerance for error.
Human reviews don't scale: 5% review rate on 100k decisions = 5,000 clinician reviews/day; offline eval datasets lag real-world edge cases.
Solution: real-time reference-free evals — LLM-as-judge plus logit-based confidence give a confidence grading without ground truth.
Uses confidence to dynamically prioritize the highest-risk cases for human review and to predict aggregate performance live across all decisions.
Validating the validator: human reviews on flagged cases continuously recalibrate the eval system in a virtuous cycle.

evalshealthcarellm-as-judge

Original description

So you've built your LLM product, have paying customers and your LLM throughput is increasing. Great! But scale introduces its own problems: it'll uncover new edge case user inputs and failure cases that your current evaluations don't capture.

And what if you just can't afford to make mistakes? (At Anterior, our product helps health insurers make decisions around approving medical treatment - this is mission-critical, with no room for error!)

The solution? A scalable and self-auditing reference-free evaluation system (rolls off the tongue, right?).

In this talk, we'll explain how to build one, why it should run real-time and how building this system provides company defensibility.

For further details and discussion, see: https://chrislovejoy.me/mission-critical-evals