Contrasting Localized Language and Image Pretraining - Apple Machine Learning Research

Contrastive language and image pretraining (CLIP) has been a famous method for training vision encoders to generate image/text representations to facilitate various applications. Recently, CLIP has been widely adopted as the backbone of the multimodal large language models (MLLM) vision for connecting image inputs for linguistic interactions. The success of CLIP as a basic vision and language model relies on aligning noisy text annotations crawled on the web to image levels. However, such criteria may be insufficient for downstream tasks that need detailed vision representations, especially when regional-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advancements. We propose a pretraining method called Contrastive Localized Language Image Pretraining (CLOC) by complementing CLIP with modules and contrastive region text loss. We formulate a new concept, fast embeddings, from which the encoder produces image embeddings that are easy to transform into representations of regions given spatial hints. To support large-scale pre-training, we designed a visually rich and spatially localized captioning framework to effectively generate region text pseudo-tags at scale. Scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement for CLIP to improve MLLMs, especially in reference and grounding tasks. .

Figure 1: Overview of our CLOC pretraining framework. (1) A visually rich and spatially localized caption channel generates pseudo-labeled bounding boxes with detailed descriptions for key regions of the image. (2) A lightweight flag attached on top of the CLIP image encoder can be asked to transform the image embedding into the region-focused feature. All parameters are trained end-to-end from scratch with our contrastive localized language image loss on the annotated region text datasets. After pre-training, region features (3a) can be generated through Prompter for region text tasks, such as object classification, without the need for training. (3b) The image encoder, together with the optional Prompter, can also strengthen the tuning of MLLMs by improving its detailed image understanding capabilities.

Contrasting Localized Language and Image Pretraining – Apple Machine Learning Research

Technical Terrence Team

Here's why an airline CEO keeps saying sustainable fuel is a myth

Leave a Reply Cancel reply

Recommended.

10 GitHub repositories to master SQL

Silvergate Bank becomes the shortest stock in the US, but sees a boost with the participation of Citadel Securities – Bitcoin News

Hertz sells its Tesla fleet at a discounted price as electric vehicle sales fall

3 Pre-Sale Tokens That Are Rising Rapidly Right Now

XRP and FLOKI See Major Market Gains; new memecoin ready to excel

Categories

Important Links