If you're looking for a new reason to be nervous about ai, try this: Some of the smartest humans in the world are struggling to create tests that ai systems can't pass.
For years, ai systems have been measured by applying a variety of standardized benchmark tests to new models. Many of these tests consisted of challenging SAT-caliber problems in areas such as mathematics, science, and logic. Comparing model scores over time served as a rough measure of ai progress.
But ai systems eventually became too good at those tests, so new, more difficult tests were created, often with the kinds of questions that graduate students might encounter on their exams.
Those tests are not in good condition either. New models from companies like OpenAI, Google, and Anthropic have scored highly on many doctoral-level challenges, limiting the usefulness of those tests and raising a chilling question: Are ai systems becoming too smart for us to measure? ?
This week, researchers at the Center for ai Safety and ai Scaling are publishing a possible answer to that question: a new assessment, called “Humanity's Last Exam,” which they claim is the most difficult ever administered to ai systems.
Humanity's Last Exam is the brainchild of Dan Hendrycks, a well-known ai safety researcher and director of the Center for ai Safety. (The test's original name, “Mankind's Last Battle,” was scrapped for being too dramatic.)
Hendrycks worked with Scale ai, an artificial intelligence company for which he is an advisor, to compile the test, which consists of approximately 3,000 multiple-choice and short-answer questions designed to test the capabilities of artificial intelligence systems in areas ranging from from analytical philosophy to rocket engineering. .
The questions were submitted by experts in these fields, including university professors and award-winning mathematicians, who were asked to ask extremely difficult questions to which they knew the answers.
Here, try to answer a question about hummingbird anatomy from the test:
Hummingbirds within the Apodiformes have only one bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded cruciate aponeurosis of insertion of m. Caudas depressors. How many paired tendons does this sesamoid bone support? Respond with a number.
Or, if physics is more your speed, try this one:
A block is placed on a horizontal rail, along which it can slide without friction. It is attached to the end of a rigid, massless rod of length R. A mass is attached to the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass receives an infinitesimal push, parallel to the rail. Assume that the system is designed so that the rod can rotate a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is again vertical, with the mass directly below the block, it carries tension T2. (Both quantities could be negative, indicating that the rod is compressed.) What is the value of (T1−T2)/W?
(I would print the answers here, but that would ruin the test of any ai system being trained in this column. Plus, I'm too dumb to check the answers myself.)
Questions on Mankind's Last Exam went through a two-step filtering process. First, the submitted questions were given to the main ai models to solve.
If the models could not answer them (or if, in the case of multiple-choice questions, the models performed worse than with random guesses), the questions were handed over to a set of human reviewers, who refined them and verified the answers. correct. . Experts who wrote the highest-rated questions were paid between $500 and $5,000 per question, in addition to receiving credit for contributing to the exam.
Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted some questions to the test. Three of his questions were chosen and, he told me, they were all “in the upper range of what one might see on a graduate exam.”
Hendrycks, who helped create a widely used ai test known as Massive Multitasking Language Understanding, or MMLU, said he was inspired to create more difficult ai tests in a conversation with Elon Musk. (Mr. Hendrycks is also a security advisor for Musk's ai company, xAI.) Mr. Musk, he said, expressed concern about existing tests performed on ai models, which he thought were too easy.
“Elon looked at the MMLU questions and said, 'These are college-level. I want things that a world-class expert can do,'” Hendrycks said.
There are other tests that attempt to measure advanced ai capabilities in certain domains, such as FrontierMath, a test developed by Epoch ai, and ARC-AGIa test developed by ai researcher François Chollet.
But humanity's latest test aims to determine how good artificial intelligence systems are at answering complex questions in a wide variety of academic subjects, giving us what could be considered a general intelligence score.
“We're trying to estimate the extent to which ai can automate a lot of really difficult intellectual work,” Hendrycks said.
With the list of questions compiled, the researchers performed humanity's ultimate test on six leading ai models, including Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet. They all failed miserably. OpenAI's o1 system scored the highest of the group, with a score of 8.3 percent.
(The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to artificial intelligence systems. OpenAI and Microsoft have denied those claims.)
Hendrycks said he expected those scores to rise quickly and potentially surpass 50 percent by the end of the year. At that point, he said, ai systems could be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we may have to look at other ways to measure ai's impacts, such as looking at economic data or judging whether it can make novel discoveries in areas such as mathematics and science.
“You can imagine a better version of this where we can ask questions that we don't know the answers to yet and we can check if the model can help us solve them,” said Scale's Summer Yue. ai research director and exam organizer.
Part of what's so confusing about ai progress these days is how patchy it is. We have ai models capable of diagnosing diseases more effectively than human doctors. <a target="_blank" class="css-yywogo" href="https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/” title=”” rel=”noopener noreferrer” target=”_blank”>win silver medals at the International Mathematics Olympiad and outperforming the best human programmers about competitive coding challenges.
But these same models sometimes have difficulty with basic tasks, such as arithmetic or writing metrical poetry. That has given them a reputation for being astonishingly brilliant at some things and totally useless at others, and has created very different impressions about how quickly ai is improving, depending on whether you look at the best or worst results.
This irregularity has also made it difficult to measure these models. Last year I wrote that we need better assessments of ai systems. I still believe that. But I also think we need more creative methods of tracking ai progress that don't rely on standardized tests, because most of what humans do (and what we fear ai does better than us) can't be captured in a written exam. .
Mr. Zhou, the theoretical particle physics researcher who submitted questions to the latest Humanity Exam, told me that while ai models were often impressive at answering complex questions, he did not consider them a threat to him. and their colleagues, because their jobs involve much more than spitting out correct answers.
“There is a huge chasm between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an ai that can answer these questions might not be ready to assist in research, which is inherently less structured.”