This article has been accepted into the UniReps Workshop at NeurIPS 2023.
Contrastive language image pretraining has become the standard approach for training visual language models. Despite the usefulness of CLIP visual features as global image representations, they have limitations when it comes to tasks involving object localization, pixel-level image understanding, or 3D perception. Multitask training is a popular solution to address this drawback, but collecting a large-scale annotated multitask dataset incurs significant costs. Furthermore, training on separate data sets for specific tasks is also challenging from an optimization and training perspective due to the alignment of gradients and knowledge coming from different input distributions and tasks. To overcome these shortcomings, we study pseudo-labeling with task-specific experts to improve CLIP features for subsequent more challenging tasks. In our approach, we leverage multiple existing open source pre-trained models and pseudo-label a dataset of uncurated web-scale image captions with experts. We then train CLIP with contrast loss and task-specific losses with pseudo-labels via lightweight heads that we attach to the vision backbone.