What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha

9.3K views · Jul 28, 2025 · 17:47 min · Watch on YouTube ↗

Takeaway

Humanoid foundation models combine internet video + simulation + scarce teleop data with a fast/slow dual-system architecture to generalize across robot embodiments.

Summary

Nvidia's GR00T N1 is a 2B-parameter open-source humanoid foundation model built for cross-embodiment fine-tuning, announced at GTC.
Robotics suffers from lack of internet-scale data; team uses a data pyramid: small real teleoperation data, synthetic simulation data (Omniverse), and unstructured human video (DreamGen world-foundation models multiply trajectories).
Architecture uses System 1 / System 2 (Kahneman-inspired): System 2 plans slowly, System 1 executes at ~120Hz outputting continuous robot-joint action vectors from image+state+language inputs.
Three-computer Nvidia stack: OVX for simulation, DGX for training, AGX (Jetson) for edge deployment.

roboticshumanoidfoundation-models

Original description

Foundation models don’t just write or draw anymore—they’re starting to move.

GR00T N1 is NVIDIA’s open Vision-Language-Action (VLA) foundation model for humanoid robots. Built with a dual-system architecture, it combines a System 2 module for high-level reasoning with a System 1 module for real-time, fluid motor control. It’s trained end-to-end on a an impressive mix of data—from human videos to robot trajectories to synthetic simulations—and deployed on a full-sized humanoid robot performing bimanual manipulation tasks in the real world.

This talk is a high-level, beginner-friendly overview of GR00T N1:
- What makes a robot foundation model different from an LLM or vision model
- How GR00T’s architecture is inspired by cognitive systems
- Why grounding language, vision, and action together unlocks new generalist capabilities

If you’ve ever wondered how large-scale AI is crossing over into the physical world, this session will get you up to speed—no robotics PhD required.

About Annika Brundyn
Annika Brundyn is a Senior Solutions Architect at NVIDIA focused on deploying generative AI systems in the real world. She works at the intersection of inference infrastructure, reasoning models, and retrieval pipelines, and has contributed to flagship projects like NVIDIA’s NeMo Retriever and the GR00T vision-language-action model. Her experience spans frontier model research and enterprise-grade deployment. She spends a lot of time helping models make fewer “creative” mistakes in production.

About Aastha Jhunjhunwala
Aastha Jhunjhunwala is a Solutions Architect at NVIDIA, focused on building optimized generative AI applications across industries. She works at the intersection of large-scale LLM pretraining, large language model inference, and NVIDIA’s full-stack generative AI infrastructure. Aastha has helped enterprises scale LLM workflows—from training models with billions of parameters to serving them efficiently with high-throughput inference. When she’s not working with language models, you’ll find her deep in the mountains, trading tokens for trail markers.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter