Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

4.8K views · May 05, 2026 · 23:57 min · Watch on YouTube ↗

Takeaway

Gemma 4 brings function calling, structured outputs, and thinking mode to under-2GB on-device models, making private edge agents practical today.

Summary

LiteRT team showcases Gemma 4 E2B (1-2GB RAM) and E4B Edge models for on-device deployment, plus Gemma 3 down to 270M parameters on Hugging Face.
New Gemma 4 capabilities: built-in function calling, structured JSON output, chain-of-thought thinking mode, and hardware-optimized cross-platform execution — all Apache 2.0.
Google AI Edge Gallery app demos on-device skills: Wikipedia-augmented Q&A, journaling/mood-tracking agents, photo-to-music pairing, animal-call generation — running on CPU/GPU (NPU coming).
Pillars for edge: latency, privacy, offline operation, and cost — hybrid cloud/edge for token economics.

edge-aigemmaon-device

Original description

As models get smaller and more capable, more AI workloads can move onto the device itself. In this talk, Chintan Parikh from Google DeepMind walks through what that looks like in practice, from Gemma 4 edge models and on-device agent skills to the real tradeoffs around latency, privacy, cost, and cross-platform deployment.

The session covers LiteRT, the Google AI Edge stack for running models across Android, iOS, desktop, web, and IoT, along with demos of local tool calling, structured output, reasoning, benchmarking, and hardware acceleration on CPUs, GPUs, and NPUs. If you're building on-device AI systems, this is a practical overview of the current edge stack and where it is headed.

Speaker info:
- https://www.linkedin.com/in/weiyiwang1993
- https://www.linkedin.com/in/chintansparikh