Vision Language Model (VLM) is an advanced artificial intelligence system that combines natural language understanding with image recognition capabilities. Like OpenAI’s CLIP and Google’s BigGAN, VLMs can understand textual descriptions and interpret images, enabling diverse applications in fields such as computer vision, content generation, and human-computer interaction. They have demonstrated impressive capabilities in understanding and generating text in context with visual content, making them a critical technology in the ai landscape.
Researchers from Google Research, Google DeepMind, and Google Cloud contrast Vision Transformer (ViT) models pretrained with classification versus contrastive targets, with contrastive pretrained models, particularly SigLIP-based PaLI, which outperform on multimodal tasks, particularly localization and text comprehension . The researchers expanded the SigLIP image encoder to 2 billion parameters, achieving new state-of-the-art multilingual multimodal retrieval. Their study argues for pre-training visual encoders with web-scale image text data instead of classification style data. Their approach reveals the benefits of extending pre-trained image encoders for classification, as demonstrated by PaLI-X on large vision language models.
Their study delves into the scaling up of VLM while underscoring the importance of smaller scale models for practicality and efficient research. He presents PaLI-3, a 5 billion parameter VLM with competitive results. The PaLI-3 training process involves contrastive pre-training of the image encoder on web-scale data, improved mixing of datasets, and higher resolution training. A 2 billion parameter multilingual contrastive vision model is presented. Ablation studies confirm the superiority of pre-trained contrastive models, especially in tasks related to localization and comprehension of visually situated texts.
Their approach employs a pre-trained ViT model as an image encoder, specifically ViT-G14, using the SigLIP training recipe. ViT-G14 has about 2 billion parameters and serves as the vision backbone for PaLI-3. Contrastive pretraining involves embedding images and text separately and classifying their correspondence. Visual tokens from the ViT output are projected and combined with text tokens. These inputs are then processed by a 3 billion parameter UL2 encoder-decoder language model for text generation, typically driven by specific task prompts such as VQA questions.
PaLI-3 excels compared to its larger counterparts, particularly in localization and comprehension of visually situated text. The SigLIP-based PaLI model, with contrasting image encoder pre-training, establishes a new state-of-the-art multilingual multimodal retrieval. The full PaLI-3 model outperforms the state of the art in baseline expression segmentation and maintains low error rates across all subgroups in detection tasks. Contrastive pretraining is more effective for localization tasks. PaLI-3’s ViT-G image encoder excels in multiple cross-modal classification and retrieval tasks.
In conclusion, their research emphasizes the benefits of contrastive pretraining, exemplified by the SigLIP approach, for improved and efficient VLMs. The smaller, 5 billion-parameter SigLIP-based PaLI-3 model excels in localization and text understanding, outperforming its larger counterparts on various multimodal benchmarks. The contrastive pre-training of the image encoder in PaLI-3 also achieves new state-of-the-art multilingual multimodal retrieval. His study highlights the need for extensive research on various aspects of VLM training beyond image encoder pre-training to further improve model performance.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, she brings a new perspective to the intersection of ai and real-life solutions.
<!– ai CONTENT END 2 –>