This article was accepted into the UniReps Workshop at NeurIPS 2023.
The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with different capabilities derived from their pre-training objectives. For example, CLIP excels at semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we present a simple recipe to efficiently merge VFMs into a unified model that absorbs their experience. Our method integrates multitasking learning, continuous learning and distillation techniques. Additionally, it requires significantly less computational cost compared to traditional multitasking training from scratch, and only requires a small fraction of the pre-training data sets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared to implementing SAM and CLIP independently, our combined model, SAM-CLIP, reduces storage and computing costs for inference, making it well suited for edge device applications. We show that SAM-CLIP not only retains the fundamental strengths of SAM and CLIP, but also introduces synergistic functionalities, particularly in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including an average IoU improvement of +6.8% and +5.9% on the Pascal-VOC and COCO-Stuff datasets, respectively.