*Equal taxpayers
A dominant paradigm in large multimodal models is to pair a large language decoder with a vision encoder. While it is known how to pre-train and tune language decoders for multimodal tasks, it is less clear how the vision encoder should be pre-trained. A de facto standard is to pre-train the vision encoder with a discriminative target, such as contrast loss. This causes a mismatch between the pre-training and the top-down autoregressive generative task. At the same time, following their success in the language domain, autoregressive image models have been shown to be capable of pre-training powerful and scalable vision encoders. This paper presents AIMv2, a family of large and powerful vision encoders pretrained with a multimodal autoregressive objective. Thanks to a multimodal decoder that generates raw patches and text tokens. Our models excel not only in multimodal tasks but also in visual recognition benchmarks such as localization, grounding, and classification. Furthermore, we show that AIMv2 models are efficient to train, outperforming the current state of the art with significantly fewer samples seen during pre-training.
Model weights available in HugsFace.