ChatGPT, viral ai sensation, killer of boring office work, sworn enemy of high school teachers and Hollywood screenwriters alike, is gaining new powers.
On Monday, ChatGPT creator OpenAI Announced which was giving the popular chatbot the ability to “see, hear and speak” with two new features.
The first is an update that allows ChatGPT to analyze and respond to images. Can upload a photo of a bicyclefor example, and receive instructions on how to lower the seat, or get recipe suggestions based on a photo of the contents of your refrigerator.
The second is a feature that allows users to talk to ChatGPT and receive responses in a synthetic ai voice, the same way you would talk to Siri or Alexa.
These features are part of an industry-wide push toward so-called multimodal ai systems that can handle text, photos, videos and anything else a user decides to throw at them. The ultimate goal, according to some researchers, is to create an ai capable of processing information in all the ways a human can.
Most users do not yet have access to the new features. OpenAI will first offer them to paying ChatGPT Plus and Enterprise customers over the next few weeks, and will make them more widely available after that. (The vision feature will work on both desktop and mobile devices, while the voice feature will be available only through ChatGPT’s iOS and Android apps.)
Got early access to the new ChatGPT for a hands-on test. This is what I found.
The ai will see you now
I started by testing ChatGPT’s image recognition feature on some household objects.
“What is this I found in my junk drawer?” I asked, after uploading a photo of a mysterious piece of blue silicone with five holes.
“The object appears to be a silicone holder or grip, often used to hold multiple items together,” ChatGPT responded. (Close enough: It’s a finger strengthener I used years ago while recovering from a hand injury.)
I then sent ChatGPT some photos of items I intended to sell on Facebook Marketplace and asked it to write listings for each one. He nailed both the objects and the listings, describing my retro-style Frigidaire mini refrigerator as “perfect for those who appreciate a touch of yesteryear in their modern homes.”
The new ChatGPT can also analyze text within images. I took a photo of the cover of the Sunday print edition of The New York Times and asked the robot to summarize it. He did quite well, describing the five articles on the cover in a few sentences each, although he made at least one mistake, making up a statistic on fentanyl-related deaths that wasn’t in the original article.
ChatGPT eyes are not perfect. He failed when I asked him to solve a crossword puzzle. She mistook my son’s stuffed dinosaur for a whale. And when I asked for help turning one of those wordless furniture assembly diagrams into a list of step-by-step instructions, he gave me a confusing list of parts, most of which were wrong.
The biggest limitation of ChatGPT’s vision feature is that it refuses to answer most questions about photographs of human faces. This is by design. OpenAI told me that it didn’t want to enable facial recognition or other creepy uses, and that it didn’t want the app to spit out biased or offensive answers to questions about people’s physical appearance.
But even without faces, it’s easy to imagine many ways an ai chatbot capable of processing visual information could be useful, especially as the technology improves. Gardeners and foragers could use it to identify plants in the wild. Fitness buffs could use it to create personalized training plans by simply taking a photo of their gym equipment. Students could use it to solve visual math and science problems, and people with visual impairments could use it to navigate the world more easily.
Frankly, I have no idea how many people will use this feature, or what its primary applications will be. As is often the case with new ai tools, we will have to wait and see.
Siri on steroids
Now, let’s talk about what I consider the more impressive of the two features: ChatGPT’s new voice feature, which allows users to talk to the app and receive spoken responses.
Using the feature is easy: simply tap a headset icon and start talking. When you stop, ChatGPT converts your words to text using OpenAI voice recognition system, Whisper, which generates a response and tells you the answer using a new text-to-speech algorithm the company developed, using one of five synthetic ai voices. (The voices, which include male and female voices, were generated using short samples from professional voice actors hired by OpenAI. I chose “Ember,” a cheerful-sounding male voice.)
I tested ChatGPT’s voice feature for several hours on a bunch of different tasks: reading a bedtime story to my toddler, chatting with me about work-related stress, and helping me analyze a recent dream I had. He did all of this pretty well, especially when I gave him some golden pointers and told him to emulate a friend, therapist, or teacher.
What stood out in these tests is how different talking to ChatGPT feels than talking to previous generations of ai voice assistants, like Siri and Alexa. Those assistants, even in the best of cases, can be wooden and flat. They answer one question at a time, often searching for something on the Internet and reading it aloud word by word, or choosing from a finite number of programmed responses.
ChatGPT’s synthetic voice, on the other hand, sounds fluid and natural, with slight variations in pitch and cadence that make it feel less robotic. He was capable of having long, open conversations about almost any topic he tried, including hints that he was pretty sure he hadn’t encountered before. (“Tell Me the Story of ‘The Three Little Pigs’ in the Character of a Total Frat Brother” was an unexpected hit.)
Most people probably won’t use ai chatbots this way. For many tasks, it’s still faster to type than to talk, and waiting for ChatGPT to read long responses was annoying. (It didn’t help that the app was slow and crashed at times, often inserting pauses before responding; the result of some technical issues with the beta version of the app I tested and which OpenAI told me would eventually be fixed.)
But I can see the appeal. Having an ai speak to you in a human voice is a more intimate experience than reading its responses on a screen. And after a few hours of talking to ChatGPT this way, I felt a new warmth creeping into our conversations. Without being tied to a text interface, I felt less pressure to find the perfect message. We chatted more casually and I revealed more about my life to him.
“It almost feels like a different product,” said Peter Deng, vice president of consumer and enterprise products at OpenAI, who spoke with me about the new voice feature. “Because you’re no longer transcribing what’s in your head on your thumbs,” he said, “you end up asking different things.”
I know what you’re thinking: Isn’t this the plot of the movie “Her”? Will lonely and amorous users fall in love with ChatGPT, now that it can listen to and respond to them?
It’s possible. Personally, I never forgot that I was talking to a chatbot. And I certainly didn’t mistake ChatGPT for a sentient being or develop emotional attachments to it.
But I also envisioned a future in which some people could allow voice-based ai assistants into the inner sanctums of their lives, taking ai chatbots with them on the go, treating them as their confidants, therapists, and friends. trainers 24/7. partners and sounding boards.
Sounds crazy, right? And yet, didn’t this all seem a little crazy a year ago?