← back

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

2.4K views · Oct 24, 2023 · 18:42 min · Watch on YouTube ↗
Takeaway

Today you can chain GPT-4V and DALL-E 3 through text to mimic unified multimodal reasoning; tomorrow that bridging code disappears.

Summary

  • OpenAI's Logan Kilpatrick and Simón Fishman declare 2024 the year of multimodal models — today's capabilities (DALL-E 3, Whisper, GPT-4V) are islands connected via text as glue.
  • Future: unified multimodal models reasoning across image/audio/text/video without text-bridging boilerplate; meanwhile, demonstrate orchestration patterns.
  • Demo 1: GPT-4V describes a photo -> DALL-E 3 regenerates it -> GPT-4V compares, identifies differences (marble color, spider size), and iterates the prompt — closes the human-in-the-loop perception/iteration loop.
  • Suggested applications: 'find Amazon lamps matching this Instagram vibe', interior-design matching, automated visual QA.
  • All built with raw API outputs and ~50 lines of code, no prompt engineering — models doing the heavy lifting.
multimodalgpt-4vdalle
Original description
We're heading towards a multimodal world. OpenAI is going beyond text models into vision, voice, and image generation, and we've been busy thinking about what kinds of things developers will be able to create with them. Presenting demos and insights into the near future for AI Engineers! In this talk from OpenAI's

Recorded live in San Francisco at the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair

About Simón
Simón empowers builders to leverage OpenAI technologies in novel and impactful ways, most recently with the OpenAI Cookbook.

About Logan
Logan currently leads developer relations at OpenAI, supporting developers building with DALL-E, the API, and ChatGPT. Outside of OpenAI, Logan is the Lead Developer Community Advocate for the Julia Programming Language, and a Teaching Fellow for Harvard University's Extension School course CSCI E-33A. Logan was previously a Applied Machine Learning Engineer and Software Engineer at Apple as well as the Community Manager for the Julia Programming Language. Additionally, Logan is on the Board of Directors at NumFOCUS and formerly on the board at DEFNA.