← back
Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind
Takeaway
Training frontier image/video diffusion models like VEO is mostly a data-curation, sampling, and distillation problem — not a modeling problem.
Summary
- Sander Dieleman (Google DeepMind, on the VEO / Nano Banana team) gives a behind-the-scenes tour of training diffusion models for audio-visual generation at scale.
- Covers eight stages: data curation, latent representations, diffusion modeling, architecture, training infra, sampling, distillation (fewer steps, not smaller models), and control signals.
- Stresses data curation as the highest-leverage lever — often a better time investment than model tweaks — and a habit researchers must unlearn from academic benchmark culture.
- Diffusion's flexible sampling regime gives audio-visual generation knobs that auto-regressive language models lack.
diffusionvideo-generationveo
Original description
https://sander.ai/2025/04/15/latents.html Speaker info: - https://sander.ai/ - https://github.com/benanne - https://www.linkedin.com/in/sanderdieleman - https://x.com/sedielem Timestamps 0:00 Introduction 2:55 Data Curation 4:02 Representation 9:39 Modeling: Diffusion Mechanism 20:01 Network Architecture 22:25 Training at Scale 23:33 Sampling & Guidance 28:03 Distillation 30:03 Control Signals