← back

Iterating on LLM apps at scale Learnings from Discord: Ian Webster

3.8K views · Nov 22, 2024 · 18:26 min · Watch on YouTube ↗
Takeaway

At Discord scale, simple deterministic evals run on every PR like unit tests beat fancy LLM-graded eval pipelines for shipping safely.

Summary

  • Ian Webster led Clyde AI — Discord's chatbot reaching 200M users; the hard part wasn't model quality but preventing kid-targeted bomb-making, harassment, racism failures at scale.
  • Treat evals like unit tests: small, fast, deterministic, local, no cloud dependency; example: 'output starts with a lowercase letter' worked >80% as well as an LLM grader for 'casual chat personality'.
  • Split test suites by sub-task (tool triggering vs static-context summarization); resist piling failure modes into prompts — diminishing then negative returns.
  • Prompts are a form of vendor lock-in; calibrate per-model when switching from GPT to Claude/Llama.
  • Use existing observability (Discord used Datadog) rather than LLM-specific tools; ship Promptfoo CLI for declarative local evals integrated with every PR.
evalsdiscordpromptfoo
Original description
Discover best practices for rapid evaluation and iteration on LLM apps in large-scale applications, with a first-hand account from Discord's engineering team. This talk covers development workflow and evaluation methodology in order to measure model & prompt improvements, mitigate risks, and speed up development. We'll discuss the best practices that we refined and implemented internally, the tooling and automation that got us shipping improvements consistently, and some of the strange and wonderful things that happen with LLMs in the wild

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Ian
Ian Webster is a Senior Staff Engineer at Discord and the maintainer of Promptfoo, a popular LLM evaluation tool. At Discord he leads teams that successfully scaled an AI-based products to millions of users while navigating the many new challenges presented by LLMs.