Large pre-trained vision and language models, such as CLIP, have shown promising generalization ability, but may struggle in specialized domains (e.g., satellite imagery) or in fine-grained classifications (e.g., car models) where concepts visuals are not seen or are underrepresented during previous training. Fast learning provides an efficient parameter tuning framework that can adapt CLIP to downstream tasks even when limited annotation data is available. In this paper, we improve rapid learning by distilling textual knowledge from natural language messages (whether human-generated or LLM) to provide valuable background information for those underrepresented concepts. We first obtain an ad “summary” aligned with each input image via a learned ad aggregator. We then jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregate summary while minimizing task loss. We call this fast embedding Aggregate and Adaptive Fast Embedding (AAPE). It has been shown to generalize to different downstream data distributions and tasks, including vision and language understanding tasks (e.g., few-shot classification, VQA). generation tasks (image captions) where AAPE achieves competitive performance. We also show that AAPE is particularly useful for handling non-canonical examples and OOD. Additionally, AAPE learning eliminates the cost of LLM-based inference as required by baselines and scales better with data and LLM. model size.