← back
Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
Takeaway
Pick a single business-aligned reliability metric for your AI app and iterate the prompt, model and data against it — generic NLP metrics are noise.
Summary
- Multinear's Dmitry Kuchin: POCs reach 50% reliability easily, but getting the last 50% requires a data-science loop, not data-science metrics.
- Critique of generic metrics (groundedness, factuality, bias) — they don't tell you if the app helps users.
- Real example: customer support bot at Twix — best north-star metric was rate of escalation to humans, not factuality.
- Treat the AI app like a continuous experiment: track prompt, model, and data changes against task-specific business metrics.
evalsllm-appsreliability
Original description
[last round of Attendee-Led 10min lightning talks] Practical tactics to build reliable AI apps. Reverse engineering real-world evals with o3. Nobody does it this way. Companies pay me $500/h for this knowledge. I help them get from POC that works 50% of the time - to the solution they can trust to deploy to production. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter