*=Equal taxpayers
Current machine learning models for vision are typically highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, suggesting the possibility of equally versatile models in computer vision. In this article, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective on a wide range of input/output modalities, including text, images, geometric and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities by mapping them to discrete tokens and performing multimodal masked modeling on a small random subset of tokens.
4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for subsequent invisible tasks or new input modalities, and (3) they can function as a generative model that can be conditioned to arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility.
Through experimental analyses, we demonstrate the potential of 4M to train versatile and scalable core models for vision tasks, laying the foundation for further exploration in multimodal learning for vision and other domains.