The RAG Stack We Landed On After 37 Fails - Jonathan Fernandes

10.8K views · Jun 03, 2025 · 18:52 min · Watch on YouTube ↗

Takeaway

A reliable on-prem RAG stack is LlamaIndex + Qdrant + open BAAI/NVIDIA embeddings + Llama/Qwen served via Ollama or TGI, with tracing built in.

Summary

Jonathan Fernandes shares his production-tested RAG stack distilled from 37 failed attempts, mostly in financial services where on-prem is required.
Orchestration: LlamaIndex (prototype + prod), with LangGraph as a prototype option; Qdrant as the vector DB for its scale-out behavior.
Embeddings: BAAI or NVIDIA open models for production; closed APIs for prototyping speed.
LLM serving: Ollama or HuggingFace TGI in Docker for on-prem deployment of Llama 3.2 or Qwen 3 (4B); includes tracing/observability for troubleshooting.

ragllamaindexqdrant

Original description

Retrieval returning irrelevant results? Can't deploy solutions in the cloud? If these questions keep you up at night, you're likely experiencing the common frustrations of building an effective RAG system. But what if we could systematically optimise each component of the pipeline? 

In this talk, I'll share the insights gained from 37 failed attempts, demonstrating live with documents from a knowledge base and how each optimisation impacts the end result. You'll walk away understanding how to diagnose the weaknesses in your RAG pipeline and apply targeted improvements that dramatically boost performance in real-world applications.