ToolHop: a new dataset designed to evaluate LLM in multi-hop tool usage scenarios

Multi-hop queries have always caused LLM agents difficulties with their solutions as they require multiple steps of reasoning and information from different sources. They are crucial for analyzing a model's understanding, reasoning, and function calling capabilities. In this time when new large models are booming every other day with claims of unparalleled capabilities, multi-hop tools realistically evaluate them by giving them a complex query, which the model needs to decompose into atomic parts and solve iteratively. invoking and using appropriate tools. . Additionally, evaluation of multi-hop tools has become critical to advancing models toward generalized intelligence.

Existing works in this field do not offer a reliable evaluation method. The methods proposed so far have been based on tool-based data construction methods in which queries are simulated for a given collection of tools. This shortfall points to a gap in ensuring the interdependence of collected tools and evaluating multi-hop reasoning. Furthermore, the absence of verifiable responses introduces model bias and evaluation errors. This article reviews the latest research that presents a reliable method for honestly evaluating the multi-hop capabilities of a large language model.

Researchers from Fudan University and ByteDance presented ToolHop, a dataset explicitly designed for evaluation of multi-hop tools with 995 rigorously designed user queries and 3,912 associated tools. Toolhop aims to solve all the aforementioned problems through various queries, locally executable tools, meaningful interdependencies, detailed comments, and verifiable answers. The authors propose a novel query-based data construction approach that could expand a single multi-hop query to a comprehensive multi-hop tool usage test case.

The proposed novel scheme comprises three key stages: tool creation, document refinement, and code generation.

Tool Creation: A preliminary set of tool documents is created based on the multi-hop query provided by the user. The document is designed to keep it interdependent and relevant by solving queries in atomic parts and handling each one individually. In this way, the document captures the essence of the query and is structured to generate similar queries, ensuring modularity and cohesion.

Document refinement: The prepared tool document undergoes extensive filtering to support model evaluation in complex multi-hop scenarios. Here, new features such as result filtering and customizable formats are introduced to expand functionality while maintaining originality. In parallel, the number of parameters is increased and their types are optimized.

Code generation: At this stage, the prepared tool generates locally executable functions. Through these functions, tools are invoked externally, allowing for fluid, multi-turn interactions between the model and tools.

The research team implemented the approach with queries extracted from the MoreHopQA dataset. Furthermore, to ensure evaluation with ToolHop, a rigorous five-dimensional analysis was carried out. ToolHop was then evaluated on fourteen LLMs from five families, including open and closed source models. The evaluation method was designed in such a way that the accuracy of the responses was guaranteed and invocation errors were minimized. The authors observed that using the tools increased model performance by up to 12% on average and up to 23% for GPT models. The best performing model could achieve 49.04% response accuracy even after augmentation. Furthermore, despite using tools in response to multi-hop queries, the models hallucinated about 10% of the time.

Conclusion:

This paper presents a complete dataset for solving multi-hop queries using specially designed queries and tools. The main finding of the experiments was that while LLMs have significantly improved their ability to solve complex multi-store queries with the use of tools, their multi-store tool usage capabilities still leave considerable room for improvement.

Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.

UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn practical information to improve LLM model performance and accuracy while protecting data privacy..

Adeeba Alam Ansari is currently pursuing her dual degree from the Indian Institute of technology (IIT) Kharagpur, where she earned a bachelor's degree in Industrial Engineering and a master's degree in Financial Engineering. With a keen interest in machine learning and artificial intelligence, she is an avid reader and curious person. Adeeba firmly believes in the power of technology to empower society and promote well-being through innovative solutions driven by empathy and a deep understanding of real-world challenges.

<a target="_blank" href="https://nebius.com/blog/posts/studio-embeddings-vision-and-language-models?utm_medium=newsletter&utm_source=marktechpost&utm_campaign=embedding-post-ai-studio”> (Recommended Reading) Nebius ai Studio Expands with Vision Models, New Language Models, Embeddings, and LoRA (Promoted)

ToolHop: a new dataset designed to evaluate LLM in multi-hop tool usage scenarios

Technical Terrence Team

Ethereum Sees $1.4 Billion in Forex Outflows This Week – Strong Accumulation Trend?

Leave a Reply Cancel reply

Recommended.

Scene text recognition using vision-based text recognition

Warren Buffett has owned these stocks for 60 years. Should I buy it today?

So far, with a massive drop in 2024, is there anything worse coming for Tesla stock?

15 EdTech Startups Selected for Inaugural AWS Education Accelerator

SVB: Mark Cuban and Bill Ackman used their influence to corner regulators

Categories

Important Links

ToolHop: a new dataset designed to evaluate LLM in multi-hop tool usage scenarios

Related

Technical Terrence Team

Ethereum Sees $1.4 Billion in Forex Outflows This Week – Strong Accumulation Trend?

Leave a Reply Cancel reply

Recommended.

Scene text recognition using vision-based text recognition

Warren Buffett has owned these stocks for 60 years. Should I buy it today?

So far, with a massive drop in 2024, is there anything worse coming for Tesla stock?

15 EdTech Startups Selected for Inaugural AWS Education Accelerator

SVB: Mark Cuban and Bill Ackman used their influence to corner regulators

Categories

Important Links

Get daily news updates to your inbox!