Due to recent technological developments, Extended Language Models (LLMs) have performed remarkably well in complex and sophisticated reasoning tasks. This is accomplished by generating intermediate steps of reasoning to elicit proofs, also known as chain-of-thought (CoT) cues. However, most current work on CoT focuses solely on language modality, and to extract CoT reasoning in multimodality, researchers frequently employ the Multimodal-CoT paradigm. Multimodal-CoT breaks multi-step problems into intermediate reasoning processes, producing the final result even when the inputs are in multiple modalities, such as vision and language. One of the most popular ways to undertake Multimodal-CoT is to combine input from multiple modalities into a single modality before prompting LLMs to undertake CoT. However, this method has several drawbacks, one of which is the significant loss of information that occurs when converting data from one modality to another. Another way to achieve CoT reasoning in multimodality is to tune small language models by combining different vision and language features.
However, the main problem with this approach is that these language models have the propensity to produce hallucinatory reasoning patterns that significantly affect response inference. To lessen the impact of such errors, the Amazon researchers proposed Multimodal-CoT, which combines visual features into a decoupled training framework. The framework divides the reasoning process into two phases: generating grounds and inferring responses. The model produces more persuasive arguments by including aspects of vision at both stages, which helps create more accurate response inferences. This work is the first of its kind that studies CoT reasoning in different modalities. In the ScienceQA benchmark, the technique, provided by Amazon researchers, demonstrates next-generation performance, exceeding the accuracy of GPT-3.5 by 16% and surpassing human performance.
The inference and reasoning stages of the multimodal response QoT use the same model architecture and differ in the type of input and output. Taking the example of a vision-language model, the model receives data from the visual and linguistic domains during the foundation generation stage. Once the justification has been produced, it is added to the initial language input in the response inference step to create the language input for the next stage. The model then receives the updated data and is trained to produce the desired result. A transformer-based model that performs three main functions (encoding, interacting, and decoding) provides the foundation of the underlying model. Simply put, the text of the language is fed to a Transformer encoder to create a textual representation. This textual representation is then combined with the vision representation and fed to the Transformer decoder.
To assess the effectiveness of their method, the researchers conducted extensive tests on the ScienceQA benchmark, a large-scale dataset of multimodal science questions containing more than 21k multimodal MCQs with annotated responses. The researchers concluded that their approach outperforms the previous state-of-the-art GPT-3.5 model by 16% in the benchmark. In a nutshell, Amazon researchers investigated and solved the problem of deriving reasoning from Multimodal-CoT by presenting a two-stage framework by fine-tuning language models to combine vision and language representations to run Multimodal-CoT. CoT. The model, therefore, generates informative justifications to facilitate the inference of final answers. The GitHub repository for the model can be accessed below.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.