In a recent research, a team of researchers examined CLIP (Contrastive Language and Image Pretraining), which is a famous neural network that effectively acquires visual concepts by monitoring natural language. CLIP, which predicts the most relevant fragment of text given an image, has helped advance vision and language modeling tasks. Although the effectiveness of CLIP has been established as a fundamental model for a number of different applications, CLIP models exhibit biases related to visual text, color, gender, etc.
A team of researchers from Shanghai ai Laboratory, Show Lab, National University of Singapore and Sun Yat-Sen University have examined CLIP's visual text bias, particularly with respect to its ability to identify text in photographs. The team studied the LAION-2B dataset in detail and found that estimating bias accurately is difficult given the huge volume of image and text data.
Image clustering on the full data set was used to solve the problem, classifying each group based on CLIP scores. This analysis aims to determine which types of image-text pairs are most favored based on CLIP score measures. Many examples with the highest CLIP scores have been included, consisting of contemporary dense text appearing at the pixel level in both captions and images.
The subtitles that match the samples have been called 'Parrot Subtitles' as they seem to give CLIP another way to achieve its goals by teaching it to recognize text without necessarily grasping visual notions. The team has studied the meaning of the parrot legends by examining the data set from three angles: the data set itself, the popular models that have been published, and the model training procedure.
The team has discovered a notable bias in the way visual text material embedded in images is described in LAION-2B subtitles. They have discovered that more than 50% of photographs have visual text content by carefully profiling the LAION-2B dataset using commercial text detection methods. Their analysis of paired image and text data has shown that more than 90% of the captions have at least one word appearing simultaneously, and the title and smeared text of the images have about 30% word overlap. This suggests that when trained on LAION-style data, CLIP deviates significantly from the fundamental assumption of semantic congruence between image and text.
The study analyzed biases in published CLIP models, specifically a significant bias in favor of detecting text in different types of web photos. The team compared alignment scores before and after text removal to examine how OpenAI's publicly available CLIP model performs on the LAION-2B dataset. The findings have shown a strong association between the visual text embedded in the images with the corresponding parrot captions and the predictions of the CLIP model.
The team also demonstrated the text detection capabilities of the CLIP and OpenCLIP models and found that OpenCLIP, which was trained on LAION-2B, shows a greater bias in favor of text detection than CLIP, which was trained on WIT-400M. . Research has focused on how CLIP models can quickly acquire text recognition skills from parrot captions, but have trouble making the connection between vision and language semantics.
Based on text-oriented parameters such as the proportion of embedded text, proportions of contemporaneous words, and relative CLIP scores of text removal, several subsets of LAION-2B have been sampled. The findings have shown that CLIP models obtain good text detection capabilities when trained on parrot caption data, but lose most of their zero generalization ability in subsequent image-text tasks.
In conclusion, this study has focused on the effects of parrot legends on the learning of the CLIP model. It has shed light on the biases associated with visual text in LAION-2B subtitles and has emphasized text detection bias in published CLIP models.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to join. our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord channel, LinkedIn Graboveand Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>