In 1977, Andrew Barto, as a researcher at the University of Massachusetts, Amherst, began to explore a new theory that Neurons behaved like hedonists. The basic idea was that the human brain was driven by billions of nerve cells tried to maximize pleasure and minimize pain.
A year later, another young researcher, Richard Sutton. Together, they worked to explain human intelligence using this simple concept and applied it to artificial intelligence. The result was “reinforcement learning”, a way for ai systems to learn from the digital equivalent of pleasure and pain.
On Wednesday, the Association for Computing Machinery, the largest company in the world of computer professionals, announced that Dr. Barto and Dr. Sutton had won this year's Turing award for their work in reinforcement learning. The Turing Award, which was introduced in 1966, is often called Nobel Computer Prize. The two scientists will share the $ 1 million prize that comes with the prize.
During the last decade, reinforcement learning has played a vital role in the emergence of artificial intelligence, including innovative technologies such as <a target="_blank" class="css-yywogo" href="https://www.wired.com/2016/05/google-alpha-go-ai/” title=”” rel=”noopener noreferrer” target=”_blank”>Google Alpha and the OpenAI chatgpt. The techniques that promoted these systems were based on the work of Dr. Barto and Dr. Sutton.
“They are the undisputed pioneers of reinforcement learning,” said Oren Etzioni, Professor Emeritus of Computer Science at the University of Washington and Founding Executive Director of the Allen Institute for artificial intelligence. “They generated key ideas, and wrote the book on the subject.”
His book, “Reinforcement learning: an introduction”, which was published in 1998, remains the definitive exploration of an idea that many experts say that it is only beginning to realize its potential.
Psychologists have long studied the ways in which humans and animals learn from their experiences. In the 1940s, the British scientific pioneer Alan Turing suggested that machines could learn in the same way.
But it was Dr. Barto and Dr. Sutton who began to explore the mathematics of how this could work, based on a theory that A. Harry Klopf had proposed, a computer scientist who worked for the government. Dr. Barto built a laboratory at Umass Amherst dedicated to the idea, while Dr. Sutton founded a similar type of laboratory at the University of Alberta in Canada.
“It is an obvious idea when you are talking about humans and animals,” said Dr. Sutton, who is also a research scientist at Keen Technologies, a new ai company and a member of the Alberta Machine Intelligence Institute, one of the three national laboratories of Canada. “As we relived, it was machines.”
This remained an academic search until the arrival of Alphago in 2016. Most experts believed that another 10 years would happen before someone built an ai system that could overcome the best players in the world in the GO game.
But during a match in Seoul, South Korea, Alphago beat Lee Sedol, the best player of the last decade. The trick was that the system had played millions of games against itself, learning by proof and error. He learned what movements they brought success (pleasure) and which ones brought the failure (pain).
The Google team that built the system was led by David Silver, a researcher who had studied reinforcement learning with Dr. Sutton at the University of Alberta.
Many experts still question whether reinforcement learning could work outside the games. Game gains are determined by points, which makes it easier for machines to distinguish between success and failure.
But reinforcement learning has also played an essential role in online chatbots.
Before the chatgpt launch in the autumn of 2022, Operai hired hundreds of people to use an early version and provide precise suggestions that could improve their skills. They showed the Chatbot how to answer particular questions, described their answers and corrected their mistakes. When analyzing those suggestions, Chatgpt learned to be a better chatbot.
Researchers call this “learning reinforcement learning of human feedback”, or RLHF and is one of the key reasons why today's chatbots respond in a surprisingly realistic way.
(The New York Times has sued Openai and his partner, Microsoft, for the violation of the copyright of the news content related to the ai systems. OpenAI and Microsoft have denied those statements).
More recently, companies such as OpenAI and the new Deepseek Chinese company have developed a form of reinforcement learning that allows chatbots to learn from themselves, as Alphago did. When working on several mathematical problems, for example, a chatbot can learn which methods lead to the correct answer and which are not.
If this process repeats with an enormously great set of problems, the bot can learn to imitate the way in which humans reason, at least in some way. The result is the so -called reasoning systems such as OPENAI O1 or Deepseek R1.
Dr. Barto and Dr. Sutton say that these systems suggest the ways in which machines will learn in the future. Finally, they say, the imbued robots will learn from trial and error in the real world, as humans and animals do.
“Learning to control a body through reinforcement learning, that is something very natural,” said Dr. Barto.
(Tagstotranslate) artificial intelligence