← back
Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati
Takeaway
A tiny VLM can punch above its weight when scope is narrow and synthetic data is meticulously engineered rather than scraped from larger models.
Summary
- Moondream is a sub-2B-parameter Apache 2.0 vision-language model that matches LLaVA 1.5 7B on VQAv2/GQA; built by fusing Google's SigLIP encoder with Microsoft's Phi-1.5 text model.
- Constrained scope (developer tool: captioning, VQA, object localization) — explicitly not solving MathVista; non-goal discipline drives data and benchmark choices.
- Training quality dominated by ~35M-image synthetic dataset; avoids GPT-4 distillation to prevent hallucinated detail; uses Localized Narratives + ~20 LLM calls per image for high-quality captions.
- Spent 1-2 orders of magnitude more compute generating training data than training the model itself.
vision-language-modelssynthetic-dataopen-source
Original description
Psst! Wanna learn how to build an AI model that punches way above its weight? Beats models 4 times its size and competes with the models from Meta, Google and OpenAI? In this talk I’ll spill the beans on how I pulled it off, and how you can too. I’ll share my unexpected journey that led to the creation of Moondream, a tiny open-source vision language model that kicks ass. I’ll share my journey and the technical hurdles that I faced along the way. I’ll also explain why small models are the future of AI. Join me for a story of accidental innovation, the democratization of AI, and how sometimes, thinking small can lead to big results. Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Vikhyat Vik's work focuses on developing efficient AI models that can run on resource-constrained devices without sacrificing performance. His mission is to democratize AI technology, making advanced computer vision accessible to developers and businesses of all sizes. Prior to his current endeavors, Vik spent 9 years at AWS, gaining valuable experience in large-scale computing systems.