Data is the new soil, and in this new fertile ground, MIT researchers are planting more than just pixels. By using synthetic images to train machine learning models, a team of scientists recently surpassed the results obtained with traditional “real image” training methods.
The core of the approach is a system called stable representative, which does not simply use synthetic images; generates them through ultra-popular text-to-image conversion models like Stable Diffusion. It’s like creating worlds with words.
So what’s in StableRep’s secret sauce? A strategy called “multipositive contrastive learning.”
“We’re teaching the model to learn more about high-level concepts through context and variation, not just feeding it data,” says Lijie Fan, a doctoral student in electrical engineering at MIT, a branch of the Computer Science Laboratory. MIT artificial intelligence (CSAIL). ), principal investigator of the work. “When multiple images are generated, all generated from the same text, all treated as representations of the same underlying thing, the model delves into the concepts behind the images, say the object, not just its pixels.”
This approach considers multiple images generated from identical text messages as positive pairs, providing additional information during training, not only adding more diversity but specifying to the vision system which images are similar and which are different. Surprisingly, StableRep eclipsed the prowess of top-level models trained on real images, such as SimCLR and CLIP, on extensive data sets.
“While StableRep helps mitigate data acquisition challenges in machine learning, it also ushers in a new era of ai training techniques. The ability to produce a variety of high-caliber synthetic images could help reduce burdensome resources and expenses,” says Fan.
The data collection process has never been easy. In the 1990s, researchers had to manually capture photographs to assemble data sets of objects and faces. In the 2000s, people were searching for data on the Internet. However, this raw, uncurated data often contained discrepancies compared to real-world scenarios and reflected social biases, presenting a distorted view of reality. The task of cleaning data sets through human intervention is not only expensive but also extremely challenging. Imagine, however, if this arduous data collection could be boiled down to something as simple as issuing a command in natural language.
A key aspect of StableRep’s success is the adjustment of the “orientation scale” in the generative model, which ensures a delicate balance between the diversity and fidelity of the synthetic images. When fine-tuned, the synthetic images used in training these self-supervised models were found to be as effective, if not more so, than real images.
Going a step further, language monitoring was added to the mix, creating an improved variant: StableRep+. When trained on 20 million synthetic images, StableRep+ not only achieved superior accuracy but also showed remarkable efficiency compared to CLIP models trained on a staggering 50 million real images.
However, the road ahead is not without its bumps. The researchers candidly address several limitations, including the current slowness of image generation, semantic mismatches between text prompts and resulting images, potential amplification of biases, and complexities in image attribution, all of which are imperative address for future developments. Another problem is that StableRep requires first training the generative model with large-scale real data. The team recognizes that starting with real data is still a necessity; However, when you have a good generative model, you can reuse it for new tasks, such as training recognition models and visual representations.
The team notes that they haven’t avoided the need to start with real data; it’s just that once you have a good generative model you can reuse it for new tasks, such as training recognition models and visual representations.
While StableRep offers a good solution by decreasing the reliance on large collections of real images, it highlights concerns about hidden biases within the uncurated data used for these text-to-image models. The choice of text prompts, an integral part of the image synthesis process, is not completely free of bias, “indicating the essential role of meticulous text selection or possible human curation,” says Fan.
“By using the latest text-to-image conversion models, we have gained unprecedented control over image generation, enabling a wide range of visual elements from a single text input. This surpasses real-world image collection in efficiency and versatility. “It is especially useful in specialized tasks, such as balancing image variety in long-tail recognition, which presents a practical complement to using real images for training,” says Fan. “Our work represents a step forward in visual learning, towards the goal of offering cost-effective training alternatives while highlighting the need for continuous improvements in data quality and synthesis.”
“One of the dreams of generative model learning has long been to be able to generate useful data for training discriminative models,” says David Fleet, a Google DeepMind researcher and professor of computer science at the University of Toronto, who was not involved in the study. article. “While we have seen some signs of life, the dream has been elusive, especially in complex large-scale domains like high-resolution imaging. This article provides compelling evidence, for the first time to my knowledge, that the dream is becoming a reality. “They show that contrastive learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to improve countless downstream vision tasks.”
Fan is joined by Yonglong Tian PhD ’22 as senior authors of the paper, as well as MIT associate professor of electrical and computer engineering and CSAIL principal investigator Phillip Isola; Huiwen Chang, Google researcher and OpenAI technical staff member; and Google staff research scientist Dilip Krishnan. The team will present StableRep at the 2023 Conference on Neural Information Processing Systems (NeurIPS) in New Orleans.