← back
RAG Evaluation Is Broken! Here's Why (And How to Fix It) - Yuval Belfer and Niv Granot
Takeaway
Standard chunk-retrieve-rerank RAG collapses on aggregative queries; structured RAG that builds per-corpus SQL schemas during ingestion is a practical fix.
Summary
- AI21 Labs argues most RAG benchmarks are 'local-question/local-answer' and don't test aggregation, counting, or all-of-X queries common in real corpora
- On a 22-doc FIFA World Cup Wikipedia corpus, standard LangChain/LlamaIndex RAG got 5% and OpenAI Responses got 11% on aggregative questions
- Fix proposed: at ingestion, cluster documents into subcorpora, infer a schema per cluster, populate into an SQL DB; at inference, route query to correct schema and use text-to-SQL
- Caveats: not every corpus is relational, normalization is hard (West Germany vs Germany), and text-to-SQL on complex schemas remains a known challenge
ragevalsstructured-rag
Original description
Optimizing local benchmarks, chunking strategies, perfect retrieval scores. If you just nodded along, you're one of many developers building RAG systems optimized for metrics that don't matter in the real world. But what if our entire approach to evaluating retrieval-augmented generation is fundamentally flawed? The uncomfortable truth is that current RAG benchmarks reward systems that fail spectacularly on realistic information retrieval tasks. In this talk, I'll expose the critical gaps in how we evaluate RAG systems today, from the chunking catch-22 to the myth of perfectly contained information. Using examples like the "Seinfeld Test," we'll explore why high benchmark scores often lead to disappointed users. You'll learn practical strategies for meaningful RAG evaluation that reflects how information actually works in the wild, helping you build systems that impress not just benchmark leaderboards, but actual humans. To learn more, check out the full episode on RAG evaluation on YAAP: https://youtu.be/RsSkwpTmn8o?si=9gIR6EeIzPgbqY4O