Article Summary: Large-scale web-crawled datasets are critical to the success of pre-training vision and language models such as CLIP. However, the inherent noise and potential irrelevance of crawled AltTexts on the web pose challenges to achieving accurate image and text alignment. Existing methods using large language models (LLMs) for subtitle rewriting have shown promise on small, curated data sets such as CC3M and CC12M. This study presents a scalable process for noisy subtitle rewriting. Unlike recent LLM rewriting techniques, we emphasize incorporating visual concepts into subtitles, called visual rich subtitles (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts along with newly generated VeCap. We show the adaptation of this method to train CLIP on large-scale web-crawled datasets, called VeCLIP. By employing this cost-effective channel, we effortlessly scaled our dataset up to 300 million samples called VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves a gain of up to +25.2% on COCO and Flickr30k retrieval tasks at the 12M configuration. In terms of data efficiency, VeCLIP achieves a +3% gain while only using 14% of the data used in basic CLIP and 11% in ALIGN. We also observe that the VeCap data is complementary to other well-selected data sets that are good for zero-shot classification tasks. By combining VeCap and DFN, our model can achieve strong performance on both image text retrieval and zero-shot classification tasks, e.g., 83.1% accuracy@1 on ImageNet zero-shot for an H/ model. 14.