← back

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

10.9K views · Jul 10, 2024 · 23:38 min · Watch on YouTube ↗
Takeaway

GPT-4o collapses audio+vision+text into one omni-model, enabling natural real-time human-computer interaction at half the price of GPT-4 Turbo.

Summary

  • OpenAI keynote tracing platform from GPT-3 (2020, AI Dungeon) → GPT-4 vision → GPT-4 Turbo unified vision+text → GPT-4o native audio/vision/text in one model.
  • GPT-4o is 2x faster, half the price of GPT-4 Turbo, with 5x higher rate limits; replaces the 3-model Whisper→GPT→TTS voice stack with one model preserving emotion and interruptions.
  • Live demo: voice mode whispers, sees a drawn Golden Gate Bridge and translates handwritten French, reads book covers (Poor Charlie's Almanack).
  • Positions the developer platform (3M devs) as the iterative-deployment surface for OpenAI's AGI mission.
gpt-4omultimodalvoice
Original description
The future we are building towards: featuring a demo of GPT4o Omnimodel Voice, ChatGPT Desktop, Sora, and Voice Engine all in one talk. 

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Romain
Hello! I’m a software engineer based in San Francisco, I run Developer Relations at OpenAI.

Previously, I ran developer relations at Stripe. Prior to that I was a Senior Developer Advocate at Twitter and the first member of Twitter’s Developer Relations team outside the US. In 2014, I helped launch Fabric, our mobile developer platform, and Digits. In 2015, our developer tour has led me to meet thousands of developers and entrepreneurs in more than 30 cities around the world.

Prior to Twitter, I was Co-Founder & CTO of Jolicloud, whose free operating system was designed to work on low cost computers and connect them to the cloud. Joli OS was the first OS based on Linux, Chromium, and HTML5, paving the way towards a new generation of browser-based platforms like Chrome OS. In 2010, the Jolibook was a finalist for “Netbook of the Year” at Engadget Awards, alongside Google’s first Chromebook.

From Text to Vision to Voice: Exploring Multimodality with OpenAI: Romain Huet