Creating general-purpose assistants that can efficiently carry out various real-world activities following users’ (multimodal) instructions has long been a goal of artificial intelligence. Recently, the area has seen increased interest in building foundational models with emerging multimodal understanding and skill generation in open-world challenges. How to create general-purpose multimodal assistants for computer vision and vision and language activities remains to be discovered, despite the effectiveness of employing large language models (LLMs) like ChatGPT to produce general-purpose assistants for natural language tasks.
Current efforts to create multimodal agents can generally be divided into two groups:
(i) End-to-end training using LLM, in which a succession of large multimodal models (LMMs) are created by continuously training LLMs to learn to interpret visual information using image-text data and multimodal tracking data. instructions. Both open source models like LLaVA and MiniGPT-4 and proprietary models like Flamingo and the multimodal GPT-4 have demonstrated impressive reasoning and visual understanding abilities. While these comprehensive training approaches work well to help LMMs acquire emergent skills (such as learning in context), creating a cohesive architecture that can seamlessly integrate a wide range of skills, such as segmentation and imaging, that are essential for real-world multimodal applications remain a difficult task.
(ii) Tool chaining with LLM, in which prompts are carefully designed to allow LLMs to call on multiple tools (such as vision models that have already been trained) to perform desired (sub)tasks, all without requiring further training in the model. VisProg, ViperGPT, Visual ChatGPT, X-GPT and MM-REACT are well-known works. The strength of these approaches is their ability to handle a wide range of visual tasks using (new) tools that can be developed inexpensively and integrated into an ai agent. However, cues need to be more flexible and reliable to allow multimodal agents to reliably choose and activate the appropriate tools (from a large and varied tool set) and compose their results to provide final solutions for multimodal tasks in the world. real in motion.
Figure 1: A graphical representation of the LLaVA-Plus possibilities possible through skill acquisition.
Researchers from Tsinghua University, Microsoft Research, the University of Wisconsin-Madison, HKUST and IDEA Research in this paper Introducing LLaVA-Plus (Large Language and Vision Assistants that Connect and Learn Skills), a multimodal assistant with a wide range of applications that acquires tool use skills through an end-to-end training methodology that methodically improves the capabilities of LMMs through visual tools. setting instructions. To the best of their knowledge, this is the first documented attempt to combine the advantages of the tool chaining described above and end-to-end training techniques. The skills repository that comes with LLaVA-Plus has a large selection of vision and visual language tools. Design is an example of the “Society of Mind” theory, in which individual tools are created for certain tasks and have limited use on their own; However, when these tools are combined, they provide emerging skills that demonstrate greater intelligence.
For example, given users’ multimodal inputs, LLaVA-Plus can instantly create a new workflow, choose and activate relevant tools from the skills library, and assemble the results of their execution to complete various non-real-world tasks. visible during model training. . Through adjustments to the instructions, LLaVA-Plus can be improved over time by adding additional capabilities or instruments. Consider a completely new multimodal tool built for a given use case or skill. To generate instruction trace data for tuning, they collect relevant user instructions that require this tool along with the results of its execution or subsequent results. After modifying the instructions, LLaVA-Plus becomes more capable as it learns to use this new tool to perform jobs that were previously impossible.
Furthermore, LLaVA-Plus deviates from previous studies on tool use training for LLM by using visual cues exclusively in conjunction with multimodal tools. On the other hand, LLaVA-Plus improves LMM’s planning and reasoning ability by using unprocessed visual cues for all human-ai contact sessions. In summary, the contributions of his article are as follows:
• Use data for a new multimodal instruction tracking tool. Using ChatGPT and GPT-4 as tagging tools, they describe a new pipeline for curating vision and language instruction tracking data intended to be used as a tool in human-ai interaction sessions.
• A great new multimodal helper. They have created LLaVA-Plus, a multimodal assistant with a wide range of uses that extends LLaVA by integrating an extensive and varied collection of external tools that can be quickly chosen, assembled and used to complete tasks. Figure 1 illustrates how LLaVA-Plus greatly expands the possibilities of LMM. Their empirical research verifies the effectiveness of LLaVA-Plus by showing consistently better results on various benchmarks, especially the new SoTA on VisiT-Bench with a wide range of real-world activities.
• No source. The materials they will make available to the public are the multimodal instruction data produced, the code base, LLaVA-Plus checkpoints, and a visual chat demonstration.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>