Humans have begun to interact with the world through the two best pillars of language and vision. This is all due to the excellent capabilities of the recently popularized Large Language Models (LLMs). LLMs have taken the world by storm with their significantly increasing performance. LLMs like GPT-3, T5, PaLM, etc., have started to mimic humans in learning to read, summarize, and generate textual data.
Researchers in the field of artificial intelligence have been developing a general-purpose assistant that can effectively follow vision and language multimodal instructions aligned with human intent to easily complete real-world tasks. For this, language-augmented base vision models are being developed in open-world visual comprehension to perform tasks such as classification, detection, segmentation, subtitles, visual generation, and editing. With the release of GPT-4 by OpenAI, the transformative model behind the famous chatbot, ChatGPT, and its multi-modal capabilities have proven to be a welcome addition to the LLM roster.
In a recent research article, the authors have presented the first attempt to use GPT-4 to generate multimodal instruction-tracking data from language pictures. The team has introduced LLaVA, a great vision and language assistant, a large end-to-end trained multimodal model that connects a vision encoder and Vicuna for general purpose visual and language comprehension. Vicuna is an open source chatbot with 13B parameters that has been trained by fine-tuning LLaMA on conversations shared by users.
LLaVa is an attempt to extend the instruction setting to the multimodal space. The main goal is to allow users to complete their tasks in real time with the help of a visual assistant that can effectively follow vision and language multimodal instructions aligned with human intent. Significant contributions made by the team are as follows:
- Multimodal Instruction Tracking Data: The team has presented a data reform perspective and a pipeline for converting image-text pairs into the Instruction Tracing format with the help of the GPT-4 model.
- Large multimodal models: The team has developed a large multimodal model by connecting CLIP’s open set visual encoder with the LLaMA language decoder and fitting them end-to-end on the generated vision and language instruction data.
- The empirical study attempts to validate the effectiveness of user-generated data for LMM instruction tuning. It even suggests practical tips for building a general-purpose instruction-following visual agent.
- SOTA performance was achieved with the help of GPT-4 on the Science QA multimodal reasoning dataset.
- Open source nature: The project is open source and the generated multimodal instruction data, the code base for model training and data generation, the model checkpoint, and a visual chat demo are open to the public for their access and can be accessed at https://github.com/haotian-liu/LLaVA.
LLaVA has demonstrated impressive multimodal chat abilities and achieved a relative score of 85.1% compared to GPT-4 on a synthetic multimodal instruction tracking dataset. When fitted in Science QA, LLaVA and GPT-4 synergy achieved a new SOTA accuracy of 92.53%. The results make LLaVA a promising approach and a great contribution to published language models.
review the Research work, Code, and Project. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.