OpenAI presents the new ChatGPT that listens, looks and speaks

While Apple and Google transform their voice assistants into chatbots, OpenAI is transforming its chatbot into a voice assistant.

On Tuesday, the San Francisco artificial intelligence startup unveiled a new version of its ChatGPT chatbot that can receive and respond to voice commands, images and videos.

The company said the new app, based on an artificial intelligence system called GPT-4o, juggles audio, images and video significantly faster than the previous version of the technology. The application will be available starting Monday, free of charge, for both smartphones and desktop computers.

“We are looking at the future of the interaction between us and machines,” said Mira Murati, the company's chief technology officer.

The new app is part of a broader effort to combine conversational chatbots like ChatGPT with voice assistants like Google Assistant and Apple's Siri. As Google merges its Gemini chatbot with Google Assistant, Apple is preparing a new version of Siri that's more conversational.

OpenAI said it would gradually share the technology with users “over the coming weeks.” This is the first time ChatGPT is offered as a desktop application.

Previously, the company offered similar technologies from various free and paid products. Now it has integrated them into a single system that is available in all its products.

During a webcast event, Murati and his colleagues showed off the new app as it responded to conversational voice commands, used a live video feed to analyze math problems written on a piece of paper, and read aloud funny stories it had. written on the fly.

The new application cannot generate videos. But it can generate still images that represent frames of a video.

With the debut of ChatGPT in late 2022, OpenAI demonstrated that machines handle requests more like people do. In response to conversational text prompts, you could answer questions, write term papers, and even generate computer code.

ChatGPT is not governed by a set of rules. He learned his skills by analyzing huge amounts of selected text from the Internet, including Wikipedia articles, books, and chat logs. Experts touted the technology as a possible alternative to search engines like Google and voice assistants like Siri.

Newer versions of the technology have also learned from sounds, images and videos. Researchers call this “multimodal ai.” Basically, companies like OpenAI started combining chatbots with ai image, audio, and video generators.

(The New York Times sued OpenAI and its partner, Microsoft, in December, alleging copyright infringement of news content related to artificial intelligence systems.)

As companies combine chatbots with voice assistants, many obstacles remain. Because chatbots learn their skills from data on the Internet, they are prone to making mistakes. Sometimes they make up information entirely, a phenomenon ai researchers call “hallucination.” These defects are being transferred to voice assistants.

While chatbots can generate compelling language, they are less adept at performing actions like scheduling a meeting or booking a plane flight. But companies like OpenAI are working to transform them into “ai agents” that can reliably handle such tasks.

OpenAI previously offered a version of ChatGPT that could accept voice commands and respond with voice. But it was a mosaic of three different ai technologies: one that converted speech to text, another that generated a text response, and another that converted this text into a synthetic voice.

The new application is based on a unique artificial intelligence technology (GPT-4o) that can accept and generate text, sounds and images. This means the technology is more efficient and the company can afford to offer it to users for free, Murati said.

“Before, there was all this latency that was the result of three models working together,” Murati said in an interview with The New York Times. “You want to have the experience that we're having, where we can have this very natural dialogue.”