← back

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

25.8K views · May 05, 2026 · 18:29 min · Watch on YouTube ↗
Takeaway

Small-model inference for retrieval is a distinct infra problem from LLM serving, and existing engines underserve it.

Summary

  • Superlinked open-sourced an inference engine specifically for small models used in AI search and document processing — embeddings, rerankers, classifiers — tested with Chroma, Qdrant, Weaviate, LanceDB.
  • Argues the inference ecosystem (vLLM, TGI) is built for large generative LLMs but ignores the throughput patterns of small embedding/reranker models used in retrieval pipelines.
  • Framing as 'yin-yang' of model inference: combining batched throughput-optimized serving with low-latency request routing for retrieval workloads.
  • Targeting infra gap between research models and production retrieval/agent pipelines.
inferenceembeddingsrag
Original description
Most embedding infrastructure assumes you know exactly which model you want ahead of time. This talk starts where that assumption breaks. Filip Makraduli walks through the real profiling mistakes, infrastructure gaps, and production constraints that led to building an embedding inference engine designed for dynamic model loading, hot-swapping, and memory-aware eviction instead of brittle one-model-per-container deployments.

If you're working on small-model inference, embeddings, or GPU infrastructure, this is a practical look at what breaks in the real world and how to design around it.

Speaker info:
- https://www.linkedin.com/in/filipmakraduli/

Timestamps:
0:00 Introduction and the gap in small model inference
0:53 Moving from research to building inference infrastructure
2:54 Introduction of the Superlinked inference engine
4:34 The importance of context management for agents
7:03 Misconceptions: Why more GPUs isn't the only answer
9:33 The "Yin and Yang" of inference: Model support and infrastructure
10:43 The challenge of supporting diverse model architectures
14:33 Deep dive into infrastructure and scalability
16:10 Conclusion and the open-source launch of SAI