As the repository of publicly available pretrained vision core models (VFMs) such as CLIP, DINOv2, and SAM grows, users face challenges in storage, memory, and computational efficiency when deploying multiple models simultaneously. To address these concerns, we present a unique approach that merges the capabilities of multiple VFMs into a single efficient multitasking model. Our method, called “co-distillation,” seamlessly integrates teacher-student learning with self-distillation, operating only on unlabeled image data and dramatically reducing computational requirements compared to traditional multitask learning. In a practical demonstration of the fusion of CLIP and SAM, we reveal that the resulting fused model, SAM-CLIP, not only maintains the fundamental strengths of both core models but also uncovers synergistic features such as text-driven zero-shot segmentation . Given the increasing availability of VFM, our methodology promises to deliver significant value in optimizing model implementation and operations.