Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind

3.8K views · Apr 21, 2026 · 40:46 min · Watch on YouTube ↗

Takeaway

Training frontier image/video diffusion models like VEO is mostly a data-curation, sampling, and distillation problem — not a modeling problem.

Summary

Sander Dieleman (Google DeepMind, on the VEO / Nano Banana team) gives a behind-the-scenes tour of training diffusion models for audio-visual generation at scale.
Covers eight stages: data curation, latent representations, diffusion modeling, architecture, training infra, sampling, distillation (fewer steps, not smaller models), and control signals.
Stresses data curation as the highest-leverage lever — often a better time investment than model tweaks — and a habit researchers must unlearn from academic benchmark culture.
Diffusion's flexible sampling regime gives audio-visual generation knobs that auto-regressive language models lack.

diffusionvideo-generationveo

Original description

https://sander.ai/2025/04/15/latents.html

Speaker info:
- https://sander.ai/
- https://github.com/benanne
- https://www.linkedin.com/in/sanderdieleman
- https://x.com/sedielem

Timestamps
0:00 Introduction
2:55 Data Curation
4:02 Representation
9:39 Modeling: Diffusion Mechanism
20:01 Network Architecture
22:25 Training at Scale
23:33 Sampling & Guidance
28:03 Distillation
30:03 Control Signals