Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

5.7K views · Apr 30, 2026 · 107:33 min · Watch on YouTube ↗

Takeaway

Gemini's free-tier API plus AI Studio is the fastest path to a multilingual conversational agent prototype without a credit card.

Summary

Hands-on workshop from Google DeepMind DevRel walking attendees through building conversational agents with the Gemini API and Google AI Studio (ai.dev).
Shows free-tier API-key creation flow at ai.dev / aistudio.com and emphasizes multilingual model behavior with audience members testing in Spanish, Romanian, Dutch, Farsi, and Czech.
Demos building voice/conversational agents directly against Gemini's multimodal API including live audio in/out.
Targets developers new to Gemini, contrasting it with paid-tier-only competitors and showing how AI Studio bridges prototyping into production API calls.

geminivoiceconversational-agents

Original description

Thor Schaeff and Philipp Schmid show how to build conversational agents with Google DeepMind's Gemini APIs, from tool-using coding agents to realtime voice interfaces. The session covers the new Interactions API, agent skills, server-side state, and the Live API workflow for streaming audio, video, and tool calls into multimodal assistants.

Speaker info:
- https://x.com/_philschmid
- https://x.com/thorwebdev

Timestamps
0:14 - Introduction and speaker introductions
6:15 - Audience interaction and project discussions
8:38 - Introduction to building conversational agents
28:17 - Discussion on Gemini Flash for coding and agentic use
36:28 - Coding agent implementation and tool calling demonstration
42:55 - Overview of the Interactions API and state management
49:05 - Introduction to the Gemini Live API
50:02 - Live Jukebox demo with music generation
54:49 - Deep dive into Gemini Flash Live features (multimodality, latency, tools)
1:06:54 - Technical setup and implementation of the Live API using WebSockets
1:25:14 - Session management and context window compression
1:26:57 - Real-world business use cases for conversational agents
1:35:02 - Multimodal grounding and handling audio inputs
1:40:00 - Discussion on personalization and speaker identification