Evaluating Domain Specific LLMs for Real World Finance — Waseem Alshikh, Writer

Original: Evaluating Domain Specific LLMs for Real World Finance — Waseem Alshikh, Writer

9.5K views · Apr 22, 2025 · 12:01 min · Watch on YouTube ↗

Takeaway

General LLMs collapse on noisy real-world finance queries and context; domain-specific models stay robust on the failure modes that matter in production.

Summary

Writer built FAIL, a finance eval benchmark of real-world query and context failure modes — misspellings, incomplete queries, out-of-domain queries, and corrupted context.
Tests whether general models at ~90% on standard benchmarks still beat Writer's Palmyra Fin and other domain-specific models when conditions degrade.
Two failure axes: query failure (user-side noise) and context failure (RAG/document-side noise) — domain models hold up better under both.
Justifies continued investment in domain-specific models despite frontier general models reaching high average accuracy.

evalsfinancedomain-specific

Original description

In today's rapidly evolving financial landscape, AI applications and agents are transforming high-value workflows, like risk assessment, fraud detection, and customer service. As financial institutions increasingly integrate AI into their operations, the reliability and trustworthiness of these systems – and the underlying LLMs – are paramount. The consequences of inaccurate outputs in finance can be significant, ranging from financial losses to reputational damage, underscoring the need for models that are tailored for industry-specific challenges like nuanced terminology and complex regulatory requirements. In this session, Waseem AlShikh, CTO and Co-founder of Writer, will challenge the scaling narrative around general-purpose models and demonstrate how domain-specific LLMs, particularly within the high-stakes finance industry, are delivering state-of-the-art performance, without the need for endless pre-training budgets and resources.Waseem will also introduce a groundbreaking financial benchmark that evaluates the performance of AI systems under the complexities and pressures of the financial domain. Drawing on real-world financial applications, Waseem will demonstrate how to evaluate systems that deliver real ROI in complex, mission-critical scenarios.

Recorded live at the Leadership Track Session Day from the AI Engineer Summit 2025 in New York. Learn more at https://ai.engineer and purchase tickets to our next event, the AI Engineer World's Fair, in SF June 3 - 5 here: https://ti.to/software-3/ai-engineer-worlds-fair-2025

About Waseem

Waseem Alshikh is the Chief Technology Officer and Co-founder of Writer, the full-stack generative AI platform trusted by the world’s leading enterprises to solve mission-critical business challenges and unleash people’s best work.  An accomplished tech executive with deep expertise in artificial intelligence, machine learning, and natural language processing, Waseem has led Writer to become one of the fastest-growing companies in enterprise generative AI, chosen by hundreds of global leaders like Accenture, Intuit, L’Oreal, Uber, and Vanguard.  Under Waseem’s leadership, Writer has developed a fully integrated solution to build and deploy secure and reliable generative AI applications and agents across the enterprise. Writer’s Palmyra family of large language models (LLMs) is recognized for state-of-the-art performance, topping leaderboards in natural language understanding and generation, and Writer’s novel approach to graph-based RAG leads industry benchmarks.  Founded in 2020, Writer is backed by strategic investors, including ICONIQ Growth, Insight Partners, WndrCo, Balderton Capital, and Aspect Ventures. Waseem has been recognized for his contributions to the tech industry, earning numerous awards and accolades. He holds degrees in Electronics from Beirut Arab University and Damascus Polytechnic University.