Every Sunday, the nPR host Will Shortz, the Guru de Crucigramas of the New York Times, consult thousands of listeners in a long -term segment called Sunday puzzles. While it is written to be solutionable without also A lot of previous knowledge, the successors are generally challenging even for qualified contestants.
That is why some experts think they are a promising way to prove the limits of ai problem solving skills.
In Recent studyA team of researchers from Wellesley College, Oberlin College, the University of Texas in Austin, Northeastern University, Charles University and Startup Cursor created a point of reference of ia using riddles of the Episodes of Sunday's puzzle. The team says that his test discovered surprising ideas, such as reasoning models, Openi's O1, among others, sometimes “surrender” and provides answers that they know are not correct.
“We wanted to develop a reference point with problems that humans can understand with only general knowledge,” Arjun Guha, a member of the Faculty of Computer Science in Northeastern and one of the co -authors of the study, told Techcrunch.
The ai industry is in a comparative evaluation dilemma at this time. Most of the tests commonly used to evaluate the models of the survey for skills, such as competition in mathematics and science questions at the doctoral level, which are not relevant to the average user. Meanwhile, many reference points, even Reference parts released relatively recently – They quickly approach the saturation point.
The advantages of a public radio questionnaire game such as Sunday's puzzle is that it does not prove esoteric knowledge, and the challenges are written so that the models cannot resort to the “memory memory” to solve them, Guha explained.
“I think what makes these problems difficult is that it is really difficult to make significant progress in a problem until you solve it, it is when everything clicks together at once,” Guha said. “That requires a combination of information and a elimination process.”
No reference point is perfect, of course. Sunday's puzzle is centered on the USA. UU. Only in English. And because the questionnaires are publicly available, it is possible that the models trained in them can “deceive” in a sense, although Guha says he has not seen evidence of this.
“New questions are released every week, and we can expect the last questions to be really invisible,” he added. “We intend to maintain the fresh reference point and track how the performance of the model changes over time.”
In the reference point of the researchers, which consists of around 600 riddles of Sunday's puzzle, models of reasoning such as O1 and R1 of Deepseek exceed the rest. Reasoning models are thoroughly verified before giving results, which helps them avoid some of the traps that normally shoot at ai models. The compensation is that the reasoning models take a little longer to reach solutions, usually second or more minutes.
At least one model, R1 of Deepseek, offers solutions that knows how to be wrong for some of Sunday's puzzle questions. R1 will literally indicate “I give up”, followed by an incorrect response apparently chosen at random, behavior with which this human can certainly relate.
The models take other strange options, such as giving an incorrect answer only to retract it immediately, try to discover a better and fail again. They also stuck “thinking” forever and give meaningless explanations for the answers, or reach a correct response immediately, but then consider alternative responses without any obvious reason.
“In difficult problems, R1 literally says that it is' frustrated,” said Guha. “It was fun to see how a model emulates what a human could say. It remains to be seen how “frustration” in reasoning can affect the quality of model results. “
The current best performance model at the reference point is O1 with a 59%score, followed by the O3-MINI recently launched with a high “reasoning effort” (47%). (R1 obtained a 35%score). As a next step, researchers plan to expand their tests to additional reasoning models, which hope to identify areas where these models could be improved.

“You do not need a doctorate to be good in reasoning, so it should be possible to design reasoning points that do not require knowledge at the doctoral level,” Guha said. “A broader reference point allows a broader set of researchers to understand and analyze the results, which in turn can lead to better solutions in the future. In addition, as the avant -garde models are increasingly implemented in environments that affect everyone, we believe that everyone should be able to intuit what they are, and are not capable of what is not capable. “
(Tagstotranslate) Benchmark (T) NPR (T) Research