Many data sets, convolutional neural networks, and transformers have achieved remarkable success in various vision tasks. In contrast, few-shot learning, where networks are limited to learning from constrained images with annotations, also becomes a research hotspot for various scenarios with scarce data and limited resources. Numerous previous publications have suggested the use of meta-learning, metric learning, and data augmentation to improve the generalizability of a model. Recent results demonstrate good zero-shot transferability for visual identification of open vocabulary using CLIP pretrained by large-scale picture-language pairs.
It is further extended for low-shot classification by CoOp tracking, CLIP-Adapter, and Tip-Adapter, which also achieves improved performance on several subsequent data sets. This shows that the network has strong rendering capabilities, even when few-shots training material is inadequate, which greatly helps few-shots learning in downstream domains. With the advent of other self-monitoring models besides CLIP, can you collaborate and adaptively integrate your prior knowledge to become better learners in a few tries? Chinese researchers suggest CaFo, a Foundation Cascade model, to address this problem by combining information from various pre-training paradigms with a “Capture, Produce, then Cache” pipeline.
They combine CLIP, DINO, DALL-E, and GPT3 to provide CaFo with four forms of prior knowledge, as seen in Figure 1. CLIP is pretrained to provide matched features for each image and its corresponding description text in the embedding space. With knowledge of contrasting languages and texts with various category meanings, CLIP can successfully categorize photos. DINO uses contrastive self-supervised learning to match representations between two transformations of the same image. DINO is an expert in differentiating between various images using contrastive knowledge of vision. DALL-E is pretrained using image and text pairings, just like CLIP, except that it learns to anticipate encoded image tokens based on the supplied text tokens. According to the supplied text, DALLE could use generative knowledge of vision to generate high-quality synthetic images in a zero-shot mode.
When given some handwritten templates as input, the large-scale language corpus trained GPT-3 automatically creates sentences that resemble human speech and are rich in generative language knowledge. The four models, therefore, have different pre-training targets and can offer complementary information to aid in the visual identification of few shots. They stagger them into three phases, specifically:
1) Fast: Based on some handwritten templates, they use GPT-3 to generate text messages for CLIP. The textual encoder in CLIP receives these instructions with a more sophisticated language understanding.
2) Produce: They use DALL-E, which extends the training data from few shots without requiring more work for collection and annotation, to produce additional training images for various categories based on the domain-specific texts.
3) Cache: To adaptively incorporate CLIP and DINO predictions, they use a caching model. They build the cache model with two types of keys by the two models pre-trained using Tip-Adapter. They adaptively combine the predictions of two cached keys as output, using zero-shot CLIP as the distribution baseline. CaFo can improve visual recognition of few shots by learning to combine prior knowledge and use its complementary properties by tuning the lightweight cache model through larger training data.
Their main contributions are summarized below:
• To enhance learning with fewer shots, they suggest using CaFo to incorporate past information from various pre-training paradigms.
• They carry out complete experiments on 11 data sets for the classification of few shots, where CaFo achieves the latest technology without using additional annotated data.
• Collaborate with CLIP, DINO, GPT-3, and DALL-E to use more semantic cues, enrich limited few-shot training data, and adaptively assemble diverse predictions through the cache model.
review the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.