A key goal in ai development is the creation of general-purpose assistants that use large multimodal models (LMMs). Creating ai systems that can work in tandem with people in diverse settings and with a wide variety of jobs is central to the general-purpose assistant concept. These assistants are not limited to a single area of expertise; they are capable of easily handling customer service, creative projects, personal task management, and even difficult analytical work. With the help of LMMs, these assistants can process and react to a wider variety of inputs, increasing their versatility and practicality.
A collaborative effort by ByteDance, NTU, CUHK, and HKUST has resulted in the development of LLaVA-OneVision, a significant advancement in large language and vision assistant (LLaVA) research. This system demonstrates how to build a model that can understand and execute a wide range of computer vision tasks in real-world scenarios. Using a basic connection module, linking vision encoders with large language models (LLMs), is a cost-effective recipe that can prove beneficial to the entire ai community.
The first LLaVA model displays remarkable multimodal conversational abilities, occasionally mimicking GPT-4V’s behavior on novel images and instructions. LLaVA-1.5 achieves state-of-the-art (SoTA) performance, meaning it outperforms all other existing models on hundreds of benchmarks with a recipe of efficient data usage, vastly expanding and improving capabilities by including more academic-related instructional data. LLaVA-NeXT leverages this quality by significantly improving performance through three main methods: AnyRes is powered by the best open-source LLM available at the time, handles high-resolution photos, and expands on high-quality instructional data. The minimalist design of the LLaVA series carries over to the model architecture with the main goals of making good use of the pre-trained capabilities of the LLM and visual model and enabling robust data and model scaling behavior.
LLaVA-OneVision modeling
The key to successful visual encoding is the representation of visual signals. The raw pixel resolution and the number of tokens in the feature space are related to this, as they determine the configuration of the representation of the visual input. Both features are scaled to improve performance, particularly in visual detail tasks. The researchers find that scaling the resolution is more effective than scaling the number of tokens in achieving a balance between performance and cost, and propose an AnyRes method with pooling.
The proposed method for data scaling in multimodal pretraining offers a more efficient approach, particularly considering the often poor quality of web-scale public text and image data. By focusing on learning high-quality knowledge on a limited computing budget, the researchers aim to refine and improve the information already possessed by pretrained LLMs and ViTs. To ensure top-notch knowledge acquisition, they carefully examine data from three main areas:
- Data on detailed descriptions with repeated captions. Among open-source LMMs, LLaVA-NeXT-34B stands out for its impressive detailed captioning capability. The team created new image captions using the model for the COCO118K, BLIP558K, and CC3M datasets. With a combined total of 3.5 million samples, they created the detailed description data with repeated captions. Using their initial version of the model to produce training data is one way to think of this as a basic effort in self-improvement ai.
- Optical character recognition and document data: The team used the 100,000-character text reading subset of the UReader dataset, available via PDF representation. The document/OCR data, consisting of 1.1 million samples, was formed by combining this text reading data with SynDOG EN/CN.
- Chinese and Language Data: The researchers set out to increase the model’s Chinese capability by using the original ShareGPT4V and GPT-4V photos offered by the Azure API to generate 92,000 fine-grained caption data. Their goal was to ensure that the model’s language understanding ability was balanced, considering the massive amount of fine-grained caption data used. From the Evo-Instruct dataset, they extracted 143,000 samples.
Tuning an LMM to interpret and respond to visual instructions is called visual instruction tuning. Language and visual media (LMM) process and respond to these instructions, such as text, images, or videos. Interpreting the instructions and giving the necessary responses requires combining visual understanding with natural language processing. Previous research has shown that the ability of the LMM is highly dependent on visual instruction tuning data. Consequently, it is essential and advantageous for the community to maintain a repository of high-quality datasets. The researchers began accumulating an uneven proportion of data across categories from a wide variety of original sources to create a large set of instruction tuning datasets. They also use several newly acquired subsets of the Cauldron and Cambrian datasets. Vision, instruction, and response form a three-level hierarchy that is used to classify the data.
Academic datasets such as VQAv2, GQA, and Visual Genome provide fixed-format data, while advanced models such as Gemini and GPT-4V/o annotate free-format data. Original answers are retained for free-format data. However, when working with fixed-format data, the team reviews each piece of material by hand and corrects any errors in the question and answer formats they find. For data types such as multiple choice, short answers, and specialized tasks (e.g., OCR), the LLaVA-1.5 prompting technique is followed. This is essential to guide the behavior of the model to avoid conflicts caused by diverse data sources and to ensure an appropriate balance between QA performance, conversational ability, and reasoning skills in more complex tasks.
One instruction set is used in single-image situations, and the second in all possible viewing circumstances. Their previous research provided the basis for this separation by demonstrating the interdependence of image and video models; specifically, a more robust image model can better generalize to tasks involving multiple photos or videos. The training data sets for single-image tasks are also much larger in quantity and better in quality than those for movies and multi-image tasks.
The team rigorously separates three important functions into three distinct learning stages to perform ablation experiments, in order to enable the LLM to have multimodal capabilities. To train the model, they follow a curriculum learning principle that systematically observes training objectives and progressively more challenging task examples.
- The first step is to align language and images. The goal is to align visual features with the word embedding space of the LLMs.
- The next step involves learning high-quality knowledge. The researchers suggest considering high-quality knowledge learning to combine computational efficiency with the incorporation of new information into LMMs.
- The researchers then implement visual instruction tuning by categorizing the instruction data into multiple sets to train LMM to respond appropriately to various visual tasks. Two distinct steps comprise the visual instruction tuning procedure: (i) Single-image training: After being trained on 3.2 million individual images, the model develops a strong ability to follow multiple instructions to perform visual tasks with a single image. (ii) Using a combination of video, single-image, and multiple-image data, the model is trained using OneVision. At this point, the model can handle more complex scenarios than those involving a single image. Emergent capabilities are created as it learns to follow instructions to perform tasks in diverse environments and applies that knowledge to other scenarios.
Using LMMs-Eval, the researchers perform consistent and repeatable testing across all benchmarks to evaluate the LLaVA-OneVision models. They primarily report data from original papers so that they can be fairly compared to other prominent LMMs. They load the models into LMMs-Eval and test them with consistent parameters when results are not available. Unless otherwise stated, they use greedy decoding and 0-shot configurations for all results. To uncover the efficacy and generalizability of the proposed paradigm, they extensively evaluate their LLaVA-OneVision models using multiple modalities such as video, audio, and single images. After the single-image and single-view stages of model training, the resulting checkpoint is referred to as LLaVA-OV (SI) and LLaVA-OV, respectively. Applications ranging from edge devices to cloud services can utilize the three available model sizes (0.5 B, 7 B, and 72 B) to accommodate different trade-offs between performance and processing power.
These findings serve as benchmarks for GPT-4V and GPT-4o. When comparing GPT-4V to GPT-4o on most benchmarks, the larger model, LLaVA-OneVision-72B, produces superior results. The results show that the recipe is effective, which bodes well for future scaling efforts. However, there is still a significant gap on more complicated tasks like visual chat scenarios; the team will leave this for future studies focused on more robust LLMs, larger training datasets, and better preference learning.
Take a look at the Paper and Project pageAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Dhanshree Shenwai is a Computer Science Engineer with extensive experience in FinTech companies spanning the Finance, Cards & Payments and Banking space and is keenly interested in the applications of artificial intelligence. She is excited to explore new technologies and advancements in today’s ever-changing world, making life easier for everyone.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>