Contrastive language image pretraining (CLIP) is a standard method for training vision and language models. While CLIP is scalable, fast, and robust to distribution changes in image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve their visual representations? To this end, we leverage open source task-specific vision models to generate pseudo-labels for a noisy, uncurated text and image dataset. We then trained CLIP models on these pseudo-tags in addition to contrastive training on image-text pairs. This simple setup shows substantial improvements of up to 16.3% in different vision tasks, including segmentation, detection, depth estimation, and surface normality estimation. Importantly, these improvements are achieved without compromising CLIP's existing capabilities, including its proficiency in rapid zero-shot classification.