open-rag-eval: RAG Evaluation without "golden" answers — Ofer Mendelevitch, Vectara

Original: open-rag-eval: RAG Evaluation without "golden" answers — Ofer Mendelevitch, Vectara

1.8K views · Jun 03, 2025 · 5:02 min · Watch on YouTube ↗

Takeaway

Open-rag-eval scores RAG quality without golden datasets using nugget-based generation evaluation and Umbrela retrieval scoring.

Summary

Open-source RAG evaluation framework (Vectara + University of Waterloo's Jimmy Lin lab) that needs no golden answers or golden chunks.
Umbrela metric scores retrieved chunks 0–3 for query relevance; correlates well with human judgment per Waterloo research.
AutoNuggetizer evaluates generation: creates atomic 'nuggets', labels vital/okay, picks top 20, LLM judge verifies each nugget is supported by the response.
Two additional metrics: citation faithfulness (full/partial/no support) and HHEM hallucination detection from Vectara's model.
Connectors for Vectara, LangChain, LlamaIndex; results viewable in drag-drop UI at open-evaluation.ai.

rag-evalgolden-freeopen-source

Original description

Open-RAG-Eval is an open-source framework that revolutionizes RAG evaluation by harnessing the power of LLM judges for scalable, automated evaluation without the need for golden answers or golden chunks. Building on pioneering research from the University of Waterloo, this framework integrates innovative tools like UMBRELA for reference-free relevance scoring and AutoNuggetizer for automated fact-checking. Designed with a flexible connectors architecture, it seamlessly plugs into any RAG pipeline while delivering fast, transparent, and interpretable metrics on retrieval, generation, and hallucination in RAG.