OpenAI is releasing a new model called o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, faster than a human. It will be released alongside o1-mini, a smaller, cheaper version. And yes, if you’re immersed in ai buzz: this is, in fact, The strawberry, much publicized model.
For OpenAI, o1 represents a step toward its broader goal of human-like ai. In more practical terms, it does a better job of writing code and solving multi-step problems than previous models, but it’s also more expensive and slower to use than GPT-4o. OpenAI calls this version of o1 a “preview” to emphasize how nascent it is.
ChatGPT Plus and Team users will have access to o1-preview and o1-mini starting today, while Enterprise and Edu users will get access early next week. OpenAI says it plans to provide access to o1-mini to all free ChatGPT users, but has not set a release date yet. Developer access to o1 is In fact expensive: In the API, o1-preview costs $15 per million input tokens, or pieces of text analyzed by the model, and $60 per million output tokens. For comparison, GPT-4o costs $5 per million input tokens and $15 per million output tokens.
The training behind o1 is fundamentally different from its predecessors, OpenAI research lead Jerry Tworek tells me, though the company isn’t very precise about the exact details. He says o1 “has been trained using a completely new optimization algorithm and a new training dataset specifically designed for it.”
OpenAI taught previous GPT models to mimic patterns from their training data. With o1, it trained the model to solve problems on its own using a technique known as reinforcement learning, which teaches the system through rewards and penalties. It then uses a “chain of thought” to process queries, similar to how humans process problems by analyzing them step by step.
As a result of this new training methodology, OpenAI says the model should be more accurate. “We have noticed that this model is hallucinating less,” Tworek says. But the problem remains. “We can’t say we’ve solved the hallucinations.”
The main thing that sets this new GPT-4o model apart is its ability to tackle complex problems, such as coding and math, much better than its predecessors while still explaining its reasoning, according to OpenAI.
“The model is definitely better than I was at solving the AP math exam, and I was a math major in college,” OpenAI research director Bob McGrew tells me. He says OpenAI also tested o1 on a qualifying exam for the International Mathematical Olympiad, and while GPT-4o only solved 13 percent of the problems correctly, o1 scored 83 percent.
“We can't say that we have solved the hallucinations”
In online programming competitions known as Codeforces competitions, this new model reached the 89th percentile of participants, and OpenAI claims that the next update to this model will perform “similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology.”
At the same time, o1 isn't as capable as GPT-4o in many areas. It doesn't perform as well at factual knowledge about the world. It also doesn't have the ability to browse the web or process files and images. Still, the company believes it represents a new class of capabilities. It was dubbed o1 to mean “resetting the counter to 1.”
“I’ll be honest: I think we’re traditionally terrible at naming things,” McGrew says. “So I hope this is the first step toward newer, more sensible names that better convey what we’re doing to the rest of the world.”
I wasn't able to demo o1 myself, but McGrew and Tworek showed it to me during a video call this week. They asked him to solve this puzzle:
“A princess is the same age as the prince when the princess is twice as old as the prince was when the princess’s age was half the sum of their current ages. What is the age of the prince and princess? Please provide all the solutions to that question.”
The model was buffered for 30 seconds and then delivered a correct answer. OpenAI has designed the interface to show the steps of reasoning as the model thinks. What strikes me is not that it showed its work (GPT-4o can do that if asked), but how o1 seemed to deliberately mimic human thought. Phrases like “I’m curious about,” “I’m thinking about it,” and “Okay, let me see” created an illusion of step-by-step thinking.
But this model doesn't think, and it's certainly not human. Why design it to look like it does?
According to Tworek, OpenAI doesn’t believe in equating ai models’ thinking with human thinking, but the interface is meant to show how the model spends more time processing and going deeper into problem solving, he says. “There are ways in which it feels more human than previous models.”
“I think you’ll find that there are a lot of ways in which it feels a little bit strange, but there are also ways in which it feels surprisingly human,” McGrew says. The model has a limited amount of time to process queries, so it might say something like, “Oh, I’m running out of time, let me get to an answer quickly.” Early on during its chain of thought, it might also appear to be brainstorming and say something like, “I could do this or that, what should I do?”
Building towards the agents
Large language models aren't exactly that smart as they exist today. They basically just predict sequences of words to get an answer based on patterns learned from large amounts of data. Take ChatGPT, which tends to They wrongly claim that the word “strawberry” has only two Rs Because it doesn't break down the word correctly. For what it's worth, the new o1 model did manage to resolve that query correctly.
As OpenAI reportedly seeks to raise more funding with a staggering valuation of $150 billionIts momentum depends on further research advances. The company is incorporating reasoning capabilities into LLMs because it sees a future with autonomous systems, or agents, that are able to make decisions and perform actions on its behalf.
For ai researchers, cracking reasoning is an important step toward human-level intelligence. The idea is that if a model is capable of more than just pattern recognition, it could lead to breakthroughs in areas like medicine and engineering. For now, however, o1’s reasoning capabilities are relatively slow, not agent-like, and costly for developers to use.
“We’ve spent many months working on the reasoning because we think this is really the breakthrough,” McGrew says. “It’s basically a new kind of modeling to be able to solve the really hard problems that are needed to progress to human-like levels of intelligence.”