Understanding how LLMs understand natural language plans, such as instructions and recipes, is crucial for their reliable use in decision-making systems. A critical aspect of plans is their temporal sequence, which reflects the causal relationships between steps. Planning, an integral part of decision-making processes, has been widely studied in domains such as robotics and body environments. Effective utilization, revision, or customization of plans requires the ability to reason about the steps involved and their causal connections. While evaluation in domains such as Blocksworld and simulated environments is common, real-world natural language plans pose unique challenges due to their inability to be physically executed to test their correctness and reliability.
Researchers at Stony Brook University, the U.S. Naval Academy, and the University of Texas at Austin have developed CAT-BENCH, a benchmark for evaluating the ability of advanced language models to predict the sequence of steps in cooking recipes. Their study reveals that current state-of-the-art language models need help with this task, even with techniques such as few-shot learning and explanation-based prompting, achieving low F1 scores. While these models can generate coherent plans, the research emphasizes significant challenges in understanding causal and temporal relationships within instructional texts. The evaluations indicate that prompting models to explain their predictions after generating them improves performance compared to traditional chain-of-thought prompting, highlighting inconsistencies in the model’s reasoning.
Early research emphasized understanding plans and goals. Plan generation involves temporal reasoning and tracking entity states. NaturalPlan focuses on a few real-world tasks involving natural language interaction. PlanBench demonstrated challenges in developing effective plans under strict syntax: goal-oriented script-building task models to produce sequences of steps for specific goals. ChattyChef uses conversational settings to refine step order. CoPlan revises steps to meet constraints. Studies such as entity states, action linking, and next event prediction explore plan understanding. Several datasets address dependencies in instructions and decision branches. However, more datasets need to focus on predicting and explaining temporal order constraints in instruction plans.
CAT-BENCH assesses the ability of models to recognize temporal dependencies between steps in cooking recipes. Based on causal relationships within the recipe’s directed acyclic graph (DAG), it poses questions about whether one step should occur before or after another. For example, determining whether placing dough on a baking sheet should precede removing a baked cake to cool depends on understanding the preconditions and effects of the steps. CAT-BENCH contains 2,840 questions across 57 recipes, evenly split between questions testing “before” and “after” temporal relationships. Models are evaluated on their precision, recall, and F1 score in predicting these dependencies, along with their ability to provide valid explanations for their judgments.
Several models were evaluated in CAT-BENCH for their performance in predicting step dependencies. In the zero shot configuration, GPT-4-turbo and GPT-3.5-turbo showed the highest F1 scores, and the GPT-4o performed unexpectedly worse. Adding explanations along with the answers generally improved model performance, notably improving the GPT-4o F1 score. However, the models were biased toward predicting dependence, which affected their overall accuracy and recall balance. Human evaluation of model-generated explanations indicated varied quality, with larger models generally outperforming smaller ones. The models needed consistency in predicting the order of steps, especially when explanations were added. Further analysis revealed common errors, such as failing to understand multi-hop dependencies and failing to identify causal relationships between steps.
CAT-BENCH introduces a new benchmark for evaluating the causal and temporal reasoning capabilities of linguistic models in understanding procedural texts such as cooking recipes. Despite advances in next-generation models (LLMs), none precisely determine whether one step in a plan should precede or succeed another, particularly in the recognition of non-dependencies. The models also show inconsistency in their predictions. Asking LLMs to provide an answer followed by an explanation significantly improves their performance compared to reasoning followed by an answer. However, human evaluation of these explanations reveals substantial room for improvement in models' understanding of step dependencies. These findings highlight current limitations in LLMs for applications of plan-based reasoning.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 45 billion users
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a strong interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>