🎨 Multimodal
Vision-language models, video understanding, image generation, multimodal agents. GPT-4V, Claude vision, Gemini, open-source VLMs.
The workflow
flowchart LR
A[Image / Video<br/>input] --> B[Vision encoder<br/>ViT / CLIP]
B --> C[Cross-attention<br/>to LLM tokens]
C --> D[LLM generates<br/>text or actions]
D --> E{Output mode}
E -->|Text| F[Description /<br/>Q&A]
E -->|Image| G[Diffusion<br/>decoder]
E -->|Action| H[Robot / GUI<br/>agent]
The interesting frontier — UIs, robots, video — once the vision encoder can reliably ground language.
Key takeaways
Videos (18)
Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati
A tiny VLM can punch above its weight when scope is narrow and synthetic data is meticulously engineered rather than scraped from larger models.
Vision AI in 2025 — Peter Robicheaux, Roboflow
Vision needs DinoV2-style self-supervised pre-training feeding into transformer detectors to unlock language-style scaling — and harder benchmarks like RF100-VL to measure it.
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
GPT-4o collapses audio+vision+text into one omni-model, enabling natural real-time human-computer interaction at half the price of GPT-4 Turbo.
Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo
Multimodal LLMs like Gemini turn driving into a generalist reasoning task, and Waymo's EMMA shows camera-only end-to-end planning can beat heavy specialized stacks.
Veo 3 for Developers — Paige Bailey, Google DeepMind
Veo 3 + Imagen 4 + Lyria 2 collapse video, image and music generation into one developer-accessible stack with native multimodal control and SynthID safeguards.
Real-time Experiments with an AI Co-Scientist - Stefania Druga, fmr. Google Deepmind
AI co-scientists work best when they observe real-world experiments live via sensor multimodality, not just retrospective paper analysis.
FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs
Black Forest Labs is pushing open-source visual AI toward 'visual intelligence' — combined generation, editing, and multi-reference composition — with Flux 2 as the latest base.
Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind
Training frontier image/video diffusion models like VEO is mostly a data-curation, sampling, and distillation problem — not a modeling problem.
ComfyUI Full Workshop — first workshop from ComfyAnonymous himself!
ComfyUI's node-based architecture plus embedded-metadata sharability turned it into the de-facto open visual-gen platform with millions of users and instant cross-model interoperability.
The Multimodal Future of Education: Stefania Druga
Multimodal AI tutors are inevitable; building child-facing tools like Cognimates teaches AI literacy and critical thinking before mental models calcify.
AI Music Generation, From Prompt to Production: Phlo Young
AI music tools like Suno and Udio reward unintuitive prompt craft and pen-name experimentation more than musical training.
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
Today you can chain GPT-4V and DALL-E 3 through text to mimic unified multimodal reasoning; tomorrow that bridging code disappears.
Robots as professional Chefs - Nikhil Abraham, CloudChef
Generic dual-arm robots plus cooking-specific multimodal embeddings can replicate professional chef labor at sub-human cost in real kitchens.
Google Photos Magic Editor: GenAI Under the Hood of a Billion-User App - Kelvin Ma, Google Photos
A billion-user GenAI app is really a system of on-device models with bespoke benchmarks, hardware co-design, and model management — generative quality alone isn't enough.
The State of Generative Media - Gorkem Yurtseven, FAL
Generative media is a brand-new market — ads, e-commerce try-on, and (especially) video are the early product-market fits driving its explosion.
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
CLIP's text/image embedding similarity enables open-ended multimodal games and apps with a fraction of the code traditional CV required.
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
Multi-modal apps win on perceived responsiveness via parallelism, structured streaming, and early playback while later assets still generate.
Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal
Azure AI Studio centralizes a vast multi-modal multi-vendor model catalog with end-to-end multimodal demos like menu reasoning and voice-preserving video translation.