New AI research from John Hopkins explains how AI can perform better on theory of mind tests than real humans

One might think about what kind of everyday circumstances can Large Language Models (LLMs) reason about. Although Extended Language Models (LLMs) have achieved great success on many tasks, they continue to need help with tasks that require reasoning. The reasoning of the so-called “theory of mind” (ToM), which involves monitoring the mental state of an agent, including its objectives and knowledge, is an area of special interest. The ability of language models to correctly answer common questions has increased substantially. However, his performance in theory of mind is somewhat inferior.

In this study, researchers from Johns Hopkins University test the idea that proper prompting can improve the ToM performance of LLMs.

For several reasons, LLMS must be able to do ToM reasoning with reliability:

🚀 JOIN the fastest ML subreddit community

ToM is a crucial component of social knowledge, enabling people to participate in complex social interactions and anticipate the actions or reactions of others.
ToM is a complicated cognitive ability further developed in humans and some other species. This may be because Tom uses structured relational information. The ability to infer the thoughts and beliefs of agents will be useful for models that interact with social data and people.
Inferential reasoning is frequently used in ToM tasks.

Approaches to learning in context can improve the capacity of LLMs for a reason. For example, to function successfully under ToM, LLMs must reason using unobservable information (such as actors’ hidden mental states), which must be inferred from the context rather than parsed from surface text (such as an explicit statement of the attributes of A situation). Therefore, evaluating and improving the performance of these models in ToM tasks can provide insights into their potential for inferential reasoning tasks. Researchers have shown that for sufficiently large language models (parameters (+100B), model performance can be improved by employing only a small number of task demonstrations described exclusively via model input (i.e., at runtime). inference, without weight updates).

The term “few shot learning” is commonly used to describe this type of performance improvement. Later studies demonstrated that LLMs’ ability for complex reasoning was enhanced when examples of few takes on the message included the steps taken to conclude (“chain of thought reasoning”). In addition, teaching language models for thinking “step by step” has been shown to improve their reasoning skills even without exemplary demonstrations. The benefits of various momentum strategies have yet to be understood theoretically. Still, several recent investigations have investigated the implications of compositional structure and local dependencies in training data on the efficacy of these methods.

Some research supports the ability of LLMs to use ToM reasoning, while others cast doubt on it. Although this previous literature offers many insights into ToM in LLM, there are two main limitations to quantitative assessments of ToM performance. ToM performance on single word or multiple choice completion is the first thing they look for for LLMs. Instead of being scored on completing a single word, LLMs can earn by freely creating solutions with multiple pieces and speculating on multiple options. Second, most of the work criticizing LLMs’ ToM abilities focused on zero-shot proofs or provided instances without providing a step-by-step rationale for the solution.

However, the output that LLMs produce can sometimes be very context sensitive. Therefore, they questioned whether recent LLMs might perform better in ToM when given the correct prompts. Here, they assess how well LLMs perform when given ToM comprehension tasks and investigate whether stimulus techniques such as thought chain reasoning, step-by-step thinking, and low-opportunity learning can improve performance. Increasing the performance of inferential reasoning through prompts is critical, as it is a flexible method that only requires new training data or new data sets of significant size. Furthermore, if efficient driving strategies direct LLMs to produce higher quality ToM responses, this improves the overall reliability of their reasoning in various everyday contexts. Raw LLM results are publicly available on GitHub.

review the Paper. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.