VeCLIP: Improving CLIP training through visually rich subtitles

Article Summary: Large-scale web-crawled datasets are critical to the success of pre-training vision and language models such as CLIP. However, the inherent noise and potential irrelevance of crawled AltTexts on the web pose challenges to achieving accurate image and text alignment. Existing methods using large language models (LLMs) for subtitle rewriting have shown promise on small, curated data sets such as CC3M and CC12M. This study presents a scalable process for noisy subtitle rewriting. Unlike recent LLM rewriting techniques, we emphasize incorporating visual concepts into subtitles, called visual rich subtitles (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts along with newly generated VeCap. We show the adaptation of this method to train CLIP on large-scale web-crawled datasets, called VeCLIP. By employing this cost-effective channel, we effortlessly scaled our dataset up to 300 million samples called VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves a gain of up to +25.2% on COCO and Flickr30k retrieval tasks at the 12M configuration. In terms of data efficiency, VeCLIP achieves a +3% gain while only using 14% of the data used in basic CLIP and 11% in ALIGN. We also observe that the VeCap data is complementary to other well-selected data sets that are good for zero-shot classification tasks. By combining VeCap and DFN, our model can achieve strong performance on both image text retrieval and zero-shot classification tasks, e.g., 83.1% accuracy@1 on ImageNet zero-shot for an H/ model. 14.

VeCLIP: Improving CLIP training through visually rich subtitles

Technical Terrence Team

The government is selling some of its NatWest shares. Should You Buy to Earn Passive Income?

Leave a Reply Cancel reply

Recommended.

Positive Ethereum Forecasts Amid Current Bull Run, What About Everlodge?

The Evolving Landscape for the Commodities Trader

Privatization: A Guide to PE Tech Acquisitions

Floki Inu Roadmap Reveals Plans for Regulated Bank Accounts, FLOKI Faces 17% Downtrend

Dr. Bomkus’ Awaited Trails on The Sandbox Takes Off Today

Categories

Important Links

VeCLIP: Improving CLIP training through visually rich subtitles

Related

Technical Terrence Team

The government is selling some of its NatWest shares. Should You Buy to Earn Passive Income?

Leave a Reply Cancel reply

Recommended.

Positive Ethereum Forecasts Amid Current Bull Run, What About Everlodge?

The Evolving Landscape for the Commodities Trader

Privatization: A Guide to PE Tech Acquisitions

Floki Inu Roadmap Reveals Plans for Regulated Bank Accounts, FLOKI Faces 17% Downtrend

Dr. Bomkus’ Awaited Trails on The Sandbox Takes Off Today

Categories

Important Links

Get daily news updates to your inbox!