Add and adapt natural language cues for subsequent CLIP generalization

Large pre-trained vision and language models, such as CLIP, have shown promising generalization ability, but may struggle in specialized domains (e.g., satellite imagery) or in fine-grained classifications (e.g., car models) where concepts visuals are not seen or are underrepresented during previous training. Fast learning provides an efficient parameter tuning framework that can adapt CLIP to downstream tasks even when limited annotation data is available. In this paper, we improve rapid learning by distilling textual knowledge from natural language messages (whether human-generated or LLM) to provide valuable background information for those underrepresented concepts. We first obtain an ad “summary” aligned with each input image via a learned ad aggregator. We then jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregate summary while minimizing task loss. We call this fast embedding Aggregate and Adaptive Fast Embedding (AAPE). It has been shown to generalize to different downstream data distributions and tasks, including vision and language understanding tasks (e.g., few-shot classification, VQA). generation tasks (image captions) where AAPE achieves competitive performance. We also show that AAPE is particularly useful for handling non-canonical examples and OOD. Additionally, AAPE learning eliminates the cost of LLM-based inference as required by baselines and scales better with data and LLM. model size.

Add and adapt natural language cues for subsequent CLIP generalization

Technical Terrence Team

Qualcomm, elf Beauty, Lyft and more By Investing.com

Leave a Reply Cancel reply

Recommended.

MicroBT Introduces New WhatsMiner M6XS+ Series at Bitcoin 2024 Conference

Boston Dynamics Unveils New Robot, MKBHD Controversy, and Tesla Layoffs

Google’s AI-powered search will reach many more countries

What's happening with IAG's share price? It's on a roll

Ethereum Istanbul Update Announcement | Ethereum Foundation Blog

Categories

Important Links

Add and adapt natural language cues for subsequent CLIP generalization

Related

Technical Terrence Team

Qualcomm, elf Beauty, Lyft and more By Investing.com

Leave a Reply Cancel reply

Recommended.

MicroBT Introduces New WhatsMiner M6XS+ Series at Bitcoin 2024 Conference

Boston Dynamics Unveils New Robot, MKBHD Controversy, and Tesla Layoffs

Google’s AI-powered search will reach many more countries

What's happening with IAG's share price? It's on a roll

Ethereum Istanbul Update Announcement | Ethereum Foundation Blog

Categories

Important Links

Get daily news updates to your inbox!