← back
Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind
Takeaway
Gemini's free-tier API plus AI Studio is the fastest path to a multilingual conversational agent prototype without a credit card.
Summary
- Hands-on workshop from Google DeepMind DevRel walking attendees through building conversational agents with the Gemini API and Google AI Studio (ai.dev).
- Shows free-tier API-key creation flow at ai.dev / aistudio.com and emphasizes multilingual model behavior with audience members testing in Spanish, Romanian, Dutch, Farsi, and Czech.
- Demos building voice/conversational agents directly against Gemini's multimodal API including live audio in/out.
- Targets developers new to Gemini, contrasting it with paid-tier-only competitors and showing how AI Studio bridges prototyping into production API calls.
geminivoiceconversational-agents
Original description
Thor Schaeff and Philipp Schmid show how to build conversational agents with Google DeepMind's Gemini APIs, from tool-using coding agents to realtime voice interfaces. The session covers the new Interactions API, agent skills, server-side state, and the Live API workflow for streaming audio, video, and tool calls into multimodal assistants. Speaker info: - https://x.com/_philschmid - https://x.com/thorwebdev Timestamps 0:14 - Introduction and speaker introductions 6:15 - Audience interaction and project discussions 8:38 - Introduction to building conversational agents 28:17 - Discussion on Gemini Flash for coding and agentic use 36:28 - Coding agent implementation and tool calling demonstration 42:55 - Overview of the Interactions API and state management 49:05 - Introduction to the Gemini Live API 50:02 - Live Jukebox demo with music generation 54:49 - Deep dive into Gemini Flash Live features (multimodality, latency, tools) 1:06:54 - Technical setup and implementation of the Live API using WebSockets 1:25:14 - Session management and context window compression 1:26:57 - Real-world business use cases for conversational agents 1:35:02 - Multimodal grounding and handling audio inputs 1:40:00 - Discussion on personalization and speaker identification