To understand social interactions in complex, real-world environments, deep mental reasoning is necessary to infer the underlying mental states that drive these interactions, known as theory of mind (ToM). Social interactions are often multimodal, involving past actions, conversations, and behaviors. For ai to effectively engage in human environments, it must understand these mental states and their interrelationships. Despite advances in machine theory of mind, current benchmarks focus primarily on individual mental states and lack multimodal datasets to assess multi-agent theory of mind. This gap hinders the development of ai systems capable of understanding nuanced social interactions, which is crucial for safe human-ai interaction.
Researchers from Johns Hopkins University and the University of Virginia presented MuMA-ToM, the first benchmark model for assessing multimodal, multiagent ToM reasoning in embodied interactions. MuMA-ToM presents videos and text describing real-life scenarios and asks questions about agents’ goals and beliefs about others’ goals. They validated MuMA-ToM through human experiments and introduced LIMP (Language model-based Inverse Multi-agent Planning), a new ToM model. LIMP outperformed existing models, including GPT-4o and BIP-ALM, by integrating two-level reasoning and eliminating the need for symbolic representations. The work highlights the gap between human and machine ToM.
ToM benchmarks traditionally focus on single-agent reasoning, while multi-agent benchmarks often lack questions about the relationships between agents. Existing ToM benchmarks are typically text- or video-based, with few exceptions such as MMToM-QA, which addresses single-agent activities in a multimodal format. However, MuMA-ToM introduces a benchmark for multi-agent ToM reasoning that uses text and video to represent realistic interactions. Unlike previous methods such as BIP-ALM, which requires symbolic representations, the LIMP model improves multi-agent planning and employs general domain-invariant representations, thereby improving ToM reasoning in multimodal and multi-agent contexts.
The MuMA-ToM Benchmark evaluates models for understanding multi-agent social interactions using video and text. It includes 225 interactions and 900 questions focused on three ToM concepts: belief inference, social goal inference, and goal belief inference. Interactions are procedurally generated with distinct multimodal inputs, challenging models to fuse this information effectively. Based on the I-POMDP framework, the benchmark employs LIMP, which integrates vision-language and language models to infer mental states. Human accuracy is high, but even the best models like Gemini 1.5 Pro and Llava 1.6 need to catch up.
In the experiments, 18 Prolific participants answered 90 randomly selected questions from the MuMA-ToM baseline model, achieving a high accuracy rate of 93.5%. State-of-the-art models, including Gemini 1.5 Pro and Llava 1.6, performed significantly worse, with the best model accuracy at 56.4%. The LIMP model outperformed the others with 76.6% accuracy by effectively integrating multimodal inputs and using natural language for action inference. However, limitations of LIMP include susceptibility to visual hallucinations and lack of explicit multi-level reasoning. The baseline model is currently limited to two-agent interactions in synthetic home environments.
In conclusion, MuMA-ToM is the first multimodal theory-of-mind benchmark model for assessing mental reasoning in complex multi-agent interactions. MuMA-ToM uses video and text inputs to assess goal and belief understanding in realistic home environments. The study systematically evaluated human performance and tested state-of-the-art models, proposing a LIMP (Language Model-based Inverse Multi-Agent Planning) model. LIMP outperformed existing models including GPT-4o and Gemini-1.5 Pro. Future work will extend the benchmark model to more complex real-world scenarios, including interactions involving multiple agents and real-world videos.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>