For Image Encoder, the image resolution size and the data set on which the models were trained varied between the CLIP and AIM models. The following table shows the results of each ablation.
Let's go over the main pieces above and explain what they are.
SHORTEN stands for Contrastive Language Image Pretraining and is intended to help your model learn visual concepts by providing names to things that should be seen as text. As the image below shows, this combines images with text encodings so that the model eventually connects the vision tokens (represented in the image below as I, with the text tokens T). This method is called contrastive training.
AIM stands for Autoregressive Image Model and is trained using a reconstructive loss optimization algorithm. The goal here is to see if the transformer can recreate (reconstruct) the image given to it.
Image resolution here it refers to the number of pixels that are fed to the transformer. For example, an image resolution of 378 x 378 means that we will pass in a matrix of that size and then convert it to embeddings that the model will then be trained on. Training data was split between (DFN-2B), (DFN-5B), (DFN-5B + VeCap) and (ImageText-400M).
The authors found that image resolution was of most importance, followed by model size, and then the content of the training data. Specifically, they saw that the better the image resolution, the better the model tended to perform for both zero-shot and low-shot indications. As more compute is needed to train and run models with higher image resolution requirements, this suggests that for Vision Transformers, compute will continue to be of utmost importance.
For VL Connector, they tested using 64 or 144 tokens for the image, tested using 224, 336, and 378 for the image resolution, and chose between a few architectures. I'll briefly go over the architectures below.
Average Grouping is exactly what it sounds like, taking the average of all the tokens and then making a linear projection of this average so that the grid is 8×8 or 12×12.
Shared care assumes that image tokens should be treated as samples from a fundamentally different population set than text tokens. Here we adjust the number of tokens that are introduced for each image, which in the paper is called k learnable queries. The researchers only considered k of 64 or 144.
Convolutional mapping is a Honeybee method that uses ResNet to dynamically decide how many tokens to pass to the LLM from the image. This is updated in the C-Abstractor module.
As can be seen from the above, the different architectures actually had very little impact. As one might assume, the higher resolution images and larger number of tokens resulted in higher performance among all connectors, but not dramatically so.
This finding suggests that we have not found a significantly better way to connect the image encoder to the LLM, or that this area is simply not where the great models will differentiate themselves.
Here, the authors played with four different types of data: images with captions, images with synthetic captions, interlaced image and text data, and text-only data. They found 4 lessons, each with a graph to summarize changes in performance.
First, data interleaving helps with low-shot and text-only performance, while closed-captioned data helps with zero-shot performance. The researchers varied the amount of interleaving they did, and the graph below shows the results. As you can see, some shot cues performed noticeably better on models trained with interleaved data than on models trained with all-or-nothing.
SecondText-only data helps with brief reasoning. Text-only in this context means that the training data includes image examples and text-only examples. This was done to ensure that the model understands human language in addition to images. Comparing subtitles only with subtitles with text shows a marked improvement for all except 0-shot reasoning; however, interleaving only performs better than interleaving plus text for all except the TextCore test.
Third, if you get the right combination of image and text, you can get really solid performance. The graph above shows different ratios of interlaced data + subtitles to text-only data. Since the goal is to have a multimodal model, they never tested performance if you don't have image data. The authors note here that the 91/9 ratio most consistently produced the best results.
Four, synthetic data helps with learning in rare cases. VeCap stands for Visually Rich Captions, which is a way to create captions that describe key visual pieces of the image. Instead, imagine a title that might explain the meaning behind a photo but doesn't explain any of the elements of the photo. You would typically do this if your data scraper found images with poor alt text data.
The authors concluded that VeCap provides a “non-trivial” boost to few-shot reasoning, but has a relatively small increase in quality. This raises questions about VeCap's profitability.
Using the results of their ablations, the authors created a Transformer in two forms: Expert Blend and regular. Both models had an encoder with a 378 x 378 image, pre-trained solely on the DFN-5B dataset. They had a mix of 45% captioned data, 45% interlaced data, and 10% text-only data (approximately the 91:9 ratio of image to text data). The VL Connector had 144 tokens and they chose a C Abstractor, although they note that it was a somewhat arbitrary choice. For the LLM itself, they created a model of parameters 3B, 7B and 30B (and the MoE model only goes up to 7B). The following graph shows the performance of these models.
Interestingly, the 30B parameter model performs on par with other models that have billions more parameters than it (LLaVA-NeXT-34B, etc.), suggesting that there may be some quantum relationship between the size of the parameters and performance.
Multimodal LLMs are an incredibly exciting part of the field. As we find better ways to transmit different types of data in tokens, we can unlock even greater applications for these transformers. As we look to the future, it is not unreasonable to now consider how other senses could be input outside of a textual description, such as sound, smell, or even touch. Data quality is likely to become increasingly valuable.
As the authors concluded that different language connectors don't make a big difference, it will be interesting to see if this means that research should focus on the image encoder, or rather if we simply haven't found a truly innovative way to use the VL Connector. .
Beyond this specific document, one of the big questions that arises is how these MLLMs will perform outside of benchmarks. As LLMs have proliferated, a common criticism revolves around the use of benchmarks to compare them. Many times, these benchmarks use a consistent data set for comparison, allowing a model to perform better by simply overfitting it, even if unintentional. Using methodologies such as ELO, the chess rating algorithm, in the LLM Arena by lmsys can give a better real-world comparison of model performance.
Finally, as more inputs can be connected to LLMs, the number of applications to which they can be applied can be expected to increase. Only time will tell how useful we can make this technology.