Raw and often unlabeled data can be retrieved and organized using representation learning. The ability of the model to develop a good representation depends on the quantity, quality and diversity of the data. In doing so, the model reflects the collective intelligence inherent in the data. The output is directly proportional to the input. Unsurprisingly, the most effective visual representation learning algorithms today rely on massive real-world data sets. Meanwhile, collecting real data has its own challenges. It is feasible to collect large amounts of unfiltered data as it is not expensive. Adding uncurated data has less impact at large data scales, indicating poor scaling behavior for self-supervised representation learning using this approach. It is also possible to collect selected data on a smaller scale, although models trained using this method can only handle very specific jobs.
To reduce the financial burden, new research by Google Research and MIT CSAIL investigates whether large-scale curated datasets that can train next-generation visual representations can be achieved using synthetic data derived from commercially available generative models. Learning from models describes this approach, which differs from direct learning from data. The team takes advantage of the new controls provided by the models' latent variables, conditioning variables, and hyperparameters to select data in the proposed method, one of the many benefits of using models as a data source to build large-scale training sets. Because models are less bulky than data, they are easier to store and share. Furthermore, models can generate infinite data samples, although with limited variability.
In this study, researchers rethink the level of detail in visual classes by using generative models. For example, consider the four images of the following commands: “A cute golden retriever sitting in a house made of sushi” and “A golden retriever, wearing sunglasses and a beach hat, rides a bicycle.” By separating embeddings of multiple images without explicitly considering the same semantics, traditional self-supervised methods like SimCLR will treat each image as a separate class. However, supervised learning algorithms (such as SupCE) will treat all of these images as belonging to the same class (such as “golden retriever”).
Since collecting multiple images described by a given caption is non-trivial, particularly as the number of captions increases, this level of granularity is challenging to extract real data. On the other hand, this capacity is intrinsic to text-to-image diffusion models; With the same title as a training set and different noise inputs, these models can generate many images that exactly match the title.
The findings of the work show that, compared to SimCLR and supervised training, the granularity at the subtitle level is superior. The fact that this visual class description is easily extensible is an added advantage. Increasing online classes (or data) hypothetically allows scaling to unlimited classes, unlike ImageNet-1k/21k, where a fixed number of classes is used. The proposed system consists of three stages:
- Synthesizing a large collection of captions is the initial stage. Using examples of word-to-subtitle translation, the team has developed a scalable method that leverages the in-context learning capabilities of large language models (LLMs).
- The next step is to create many synthetic images and titles using a text-to-image diffusion model. In this way, a data set of 600 million photographs is generated.
- Finally, they train models for visual representations using masked image modeling and multipositive contrastive learning.
The researchers compare OpenAI's CLIP with respect to superior linear probing accuracy on ImageNet-1K with the ViT-B model at 80.7% and the ViT-L model at 83.0%, both trained with SynCLR pre-training. On fine-grained classification tasks, SynCLR achieves comparable results to DINO v2 models derived from a pre-trained ViT-g model, outperforming CLIP for ViT-B by 3.3% and ViT-L by 1.5%. With respect to semantic segmentation on ADE20k, SynCLR outperforms MAE pretrained on ImageNet by 6.2 and 4.1 in mIoU for ViT-B and ViT-L, respectively, in the same configuration. This shows that SynCLR has a strong ability to transfer dense prediction tasks, much like DINO v2, which also requires training on images with a resolution of 518 × 518, something that SynCLR does not have.
The team highlights that there are several ways to improve subtitle sets. For example, they use more sophisticated LLMs, improve sample ratios between different concepts, and expand the library of in-context examples. One way to improve the learning process is to add a high-resolution training phase or an intermediate IN-21k fine-tuning stage after extracting knowledge from a larger model. They also suggest that, along with the integration of SwiGLU and LayerScale, better model initialization procedures can lead to architectural benefits. However, they suggest these areas for future research due to limited resources and the limitations of this article, which did not aim to achieve the highest possible metrics.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to join. our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord channel, LinkedIn Grabove, Twitterand Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>