← back

Gemma 4 Deep Dive — Cassidy Hardin, Researcher, Google DeepMind

33.3K views · Apr 27, 2026 · 19:02 min · Watch on YouTube ↗
Takeaway

Gemma 4 brings frontier-tier reasoning, MoE efficiency, and 256K context to fully open Apache 2.0 weights that run on consumer hardware.

Summary

  • Google DeepMind's Cassidy Hardin walks through Gemma 4's four sizes: a 31B dense, a 26B MoE (3.9B active params, 128 experts, 8 active), and two on-device 'effective' models.
  • 31B dense ranks #3 on the LMSYS arena, beats models 20x its size, has 256K context and native function calling, structured JSON, and thinking.
  • Family ships under Apache 2.0 license and is purpose-built for on-device deployment on phones, iPads and laptops with audio support.
  • 26B MoE is the first MoE in the Gemma line, optimized for cheap inference while preserving large-model performance.
gemmaopen-modelsgoogle-deepmind
Original description
Open models are getting smaller, faster, and far more capable. In this talk, Cassidy Hardin walks through the latest advances in the Gemma family, with a focus on Gemma 4 and what it enables for developers building on-device and open-weight AI systems. She covers the architecture behind Gemma’s dense, effective, and mixture-of-experts models, including improvements to attention, multimodal support for text, vision, and audio, and the design decisions that make strong reasoning, coding, and agentic workflows possible at practical sizes.

Speaker info:
- https://uk.linkedin.com/in/cassidyhardin

Timestamps:
00:00:28 - Introduction to the Gemma 4 model family and its four size categories
00:01:54 - Shift to Apache 2.0 licensing for developer accessibility
00:02:25 - Deep dive into the 31B dense reasoning and 26B mixture-of-experts (MoE) models
00:03:30 - Overview of on-device effective models (2B and 4B) with multimodal support
00:04:21 - Architectural updates: interleaved local/global attention and grouped query attention
00:06:51 - Explanation of the new MoE architecture (128 experts, 8 active)
00:07:44 - Implementation of Per Layer Embeddings (PLE) to optimize on-device memory
00:11:06 - Multimodal advances: variable aspect ratios and resolutions for vision encoders
00:16:31 - Audio processing enhancements via conformer architecture and audio tokenizers
00:18:07 - Getting started: self-hosting (Hugging Face, Ollama) and cloud deployment (Vertex AI)