← all topics

🎨 Multimodal

Vision-language models, video understanding, image generation, multimodal agents. GPT-4V, Claude vision, Gemini, open-source VLMs.

18 videos · multimodalvideo-generationveoroboticsdiffusionvision-language-models

The workflow

flowchart LR
    A[Image / Video<br/>input] --> B[Vision encoder<br/>ViT / CLIP]
    B --> C[Cross-attention<br/>to LLM tokens]
    C --> D[LLM generates<br/>text or actions]
    D --> E{Output mode}
    E -->|Text| F[Description /<br/>Q&A]
    E -->|Image| G[Diffusion<br/>decoder]
    E -->|Action| H[Robot / GUI<br/>agent]

The interesting frontier — UIs, robots, video — once the vision encoder can reliably ground language.

Key takeaways

A tiny VLM can punch above its weight when scope is narrow and synthetic data is meticulously engineered rather than scraped from larger models.
Vision needs DinoV2-style self-supervised pre-training feeding into transformer detectors to unlock language-style scaling — and harder benchmarks like RF100-VL to measure it.
GPT-4o collapses audio+vision+text into one omni-model, enabling natural real-time human-computer interaction at half the price of GPT-4 Turbo.
Multimodal LLMs like Gemini turn driving into a generalist reasoning task, and Waymo's EMMA shows camera-only end-to-end planning can beat heavy specialized stacks.
Veo 3 + Imagen 4 + Lyria 2 collapse video, image and music generation into one developer-accessible stack with native multimodal control and SynthID safeguards.
AI co-scientists work best when they observe real-world experiments live via sensor multimodality, not just retrospective paper analysis.

Videos (18)

Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati

A tiny VLM can punch above its weight when scope is narrow and synthetic data is meticulously engineered rather than scraped from larger models.

17.9K views · Nov 14, 2024

Vision AI in 2025 — Peter Robicheaux, Roboflow

Vision needs DinoV2-style self-supervised pre-training feeding into transformer detectors to unlock language-style scaling — and harder benchmarks like RF100-VL to measure it.

11.9K views · Aug 03, 2025

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

GPT-4o collapses audio+vision+text into one omni-model, enabling natural real-time human-computer interaction at half the price of GPT-4 Turbo.

10.9K views · Jul 10, 2024

Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo

Multimodal LLMs like Gemini turn driving into a generalist reasoning task, and Waymo's EMMA shows camera-only end-to-end planning can beat heavy specialized stacks.

6.4K views · Jul 26, 2025

Veo 3 for Developers — Paige Bailey, Google DeepMind

Veo 3 + Imagen 4 + Lyria 2 collapse video, image and music generation into one developer-accessible stack with native multimodal control and SynthID safeguards.

4.8K views · Jun 21, 2025

Real-time Experiments with an AI Co-Scientist - Stefania Druga, fmr. Google Deepmind

AI co-scientists work best when they observe real-world experiments live via sensor multimodality, not just retrospective paper analysis.

4.4K views · Jul 28, 2025

FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs

Black Forest Labs is pushing open-source visual AI toward 'visual intelligence' — combined generation, editing, and multi-reference composition — with Flux 2 as the latest base.

4.1K views · May 08, 2026

Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind

Training frontier image/video diffusion models like VEO is mostly a data-curation, sampling, and distillation problem — not a modeling problem.

3.8K views · Apr 21, 2026

ComfyUI Full Workshop — first workshop from ComfyAnonymous himself!

ComfyUI's node-based architecture plus embedded-metadata sharability turned it into the de-facto open visual-gen platform with millions of users and instant cross-model interoperability.

3.5K views · Jul 19, 2025

The Multimodal Future of Education: Stefania Druga

Multimodal AI tutors are inevitable; building child-facing tools like Cognimates teaches AI literacy and critical thinking before mental models calcify.

2.9K views · Oct 24, 2024

AI Music Generation, From Prompt to Production: Phlo Young

AI music tools like Suno and Udio reward unintuitive prompt craft and pen-name experimentation more than musical training.

2.6K views · Feb 11, 2025

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Today you can chain GPT-4V and DALL-E 3 through text to mimic unified multimodal reasoning; tomorrow that bridging code disappears.

2.4K views · Oct 24, 2023

Robots as professional Chefs - Nikhil Abraham, CloudChef

Generic dual-arm robots plus cooking-specific multimodal embeddings can replicate professional chef labor at sub-human cost in real kitchens.

2.3K views · Jul 20, 2025

Google Photos Magic Editor: GenAI Under the Hood of a Billion-User App - Kelvin Ma, Google Photos

A billion-user GenAI app is really a system of on-device models with bespoke benchmarks, hardware co-design, and model management — generative quality alone isn't enough.

2.3K views · Jul 19, 2025

The State of Generative Media - Gorkem Yurtseven, FAL

Generative media is a brand-new market — ads, e-commerce try-on, and (especially) video are the early product-market fits driving its explosion.

1.5K views · Jul 16, 2025

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

CLIP's text/image embedding similarity enables open-ended multimodal games and apps with a fraction of the code traditional CV required.

1.3K views · Nov 22, 2023

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Multi-modal apps win on perceived responsiveness via parallelism, structured streaming, and early playback while later assets still generate.

1.2K views · Jan 23, 2024

Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal

Azure AI Studio centralizes a vast multi-modal multi-vendor model catalog with end-to-end multimodal demos like menu reasoning and voice-preserving video translation.

475 views · Feb 06, 2025