Meet FANToM: A Benchmark for Theory of Mind Interaction Stress Testing Machines

In conversational ai, assessing Theory of Mind (ToM) through question answering has become an essential benchmark. However, passive narratives need to improve when evaluating ToM capabilities. To address this limitation, various questions have been designed that require the same reasoning skills. These questions have revealed the limited ToM capabilities of LLMs. Even with chain-of-thought reasoning or adjustments, cutting-edge LLMs still require help addressing these questions and perform below human standards.

Researchers from different universities presented FANToM, a benchmark for testing ToM in LLM by answering conversational questions. Incorporates psychological and empirical knowledge into LLM assessment. FANToM is challenging for top LLMs, who perform worse than humans even with advanced reasoning or adjustments. The benchmark evaluates LLMs by requiring binary answers to questions about characters’ knowledge and listing characters with specific information. Human performance was evaluated with 11 student volunteers.

FANToM is a new English benchmark designed to assess automatic ToM in conversational contexts, focusing on social interactions. It includes 10,000 questions within multi-party conversations, emphasizing the asymmetry of information and the different mental states between the characters. The goal is to measure the models’ ability to track beliefs in discussions, test their understanding of others’ mental states, and identify cases of illusory ToM.

FANToM tests machine ToM in LLM by answering questions in conversational contexts with information asymmetry. It includes 10,000 questions based on multi-party conversations where characters have different mental states due to inaccessible information. The benchmark assesses LLMs’ ability to track beliefs in discussions and identify illusory ToM. Despite chain-of-thought reasoning or tuning, existing LLMs perform significantly worse on FANToM than humans, as indicated by the evaluated results.

The FANToM evaluation results reveal that even with chain-of-thought reasoning or adjustments, existing LLMs perform significantly worse than humans. Some LLM ToM reasoning in FANToM is considered illusory, indicating its inability to understand the characters’ different perspectives. While applying zero-possibility chain-of-thought logic or fine-tuning improves LLM scores, substantial gaps remain compared to human performance. The findings highlight the challenges in developing models with coherent Theory of Mind reasoning, emphasizing the difficulty of achieving human-level understanding in LLMs.

In conclusion, FANToM is a valuable benchmark for evaluating ToM in LLM during conversational interactions, highlighting the need for more interaction-oriented standards that better align with real-world use cases. The measure has shown that current LLMs underperform humans, even with advanced techniques. He has identified the problem of internal coherence in neural models and provided several approaches to address it. FANToM emphasizes the distinction between accessible and inaccessible information in ToM reasoning.

Future research directions include grounding ToM reasoning in pragmatics, visual information, and belief graphs. Assessments can encompass diverse conversation scenarios beyond small talk on specific topics, and multimodal aspects such as visual information can be integrated. It is crucial to address the question of the internal coherence of neural models. FANToM is now publicly available for future research, advancing the understanding of ToM in LLMs. Future studies may consider incorporating relationship variables for more dynamic social reasoning.

Review the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.

If you like our work, you’ll love our newsletter.

we are also in Telegram and WhatsApp.

Hello, my name is Adnan Hassan. I’m a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a double degree from the Indian Institute of technology, Kharagpur. I am passionate about technology and I want to create new products that make a difference.

<!– ai CONTENT END 2 –>

Meet Retouch4me – a family of ai-powered plugins for photo retouching

Meet FANToM: A Benchmark for Theory of Mind Interaction Stress Testing Machines

Technical Terrence Team

Why investors should pay attention to bond markets

Leave a Reply Cancel reply

Recommended.

Tech & Learning kicks off 2024 events with DMV Regional Leadership Summit

Ethereum price is affected and risks registering new lows

Ethereum Price Drops Over 5%, Can Bears Push ETH Below $2,800?

Bitcoin vulnerability discovered by developer has been flagged by US government

Cruise operator Carnival raises 2024 profit forecast amid record demand By Reuters

Categories

Important Links

Meet FANToM: A Benchmark for Theory of Mind Interaction Stress Testing Machines

Related

Technical Terrence Team

Why investors should pay attention to bond markets

Leave a Reply Cancel reply

Recommended.

Tech & Learning kicks off 2024 events with DMV Regional Leadership Summit

Ethereum price is affected and risks registering new lows

Ethereum Price Drops Over 5%, Can Bears Push ETH Below $2,800?

Bitcoin vulnerability discovered by developer has been flagged by US government

Cruise operator Carnival raises 2024 profit forecast amid record demand By Reuters

Categories

Important Links

Get daily news updates to your inbox!