A paradigm shift in multimodal learning has occurred thanks to the contributions of large multimodal core models such as CLIP, Flamingo, and Stable Diffusion, which enable previously unimaginable improvements in image generation and zero-shot generalization. These reference models are typically trained on large, web-scale static data sets. It is unknown whether or not legacy models, such as OpenAI’s CLIP models, which were trained on Internet-scale data up to 2020, will work with future data.
To start, researchers from Apple and Carnegie Mellon University examine how OpenAI’s CLIP models compare to those in the OpenCLIP repository that were developed using more up-to-date curated web datasets that include data through 2022 in terms of robustness. Due to the lack of a standard against which CLIP models can be measured, they have compiled a set of dynamic classification and recovery tasks covering the period 2014-2022. While the OpenCLIP models maintain their performance, the team found that the OpenAI models show a substantial disparity in data retrieval performance from 2021-2022 compared to 2014-2016. While OpenAI CLIP models are slightly more robust than OpenCLIP models, this is not fully reflected in typical tests such as the accuracy of ImageNet distribution changes.
Their work reveals that using static benchmarks (such as ImageNet) has its limitations and that models must adapt and evolve along with changing data distributions. A simplistic but common method to adapt to changing data is to start over each time you get a new set of image and text data and train a new CLIP model. The reasoning behind this method is that it is more difficult to adapt a model’s behavior to new inputs when training is started from an existing model. However, it is not practical to repeatedly invest the time and energy required to train new basic models from scratch.
Recent efforts focused on perpetual learning techniques for CLIP models have primarily aimed to increase efficiency on a single subsequent task or a small number of tasks. Although some recent research has begun to address these issues, current benchmarks are too modest in scope or lack linked image and text data to be truly useful.
As a first step toward time-continuous training (TIC) of CLIP models, researchers observe the natural change in data distribution over time. By including “trace time” data into the already existing CommonPool dataset, they establish TIC-DataComp as a new benchmark for continuous training over time of CLIP models. Researchers are also recycling large-scale data sets from the Internet, collected from places like Reddit and Flickr, for new purposes. In particular, they use the time information provided by YFCC and Redcaps to select TIC-YFCC and TICRedCaps, respectively. Whenever a new data set becomes available, this work aims to develop continuous learning techniques that can operate within a limited computational budget. These strategies work against Oracle, which resets its training parameters every time it receives new data and spends its accumulated computing budget on an entirely new model.
The researchers conduct an initial evaluation of models trained on the TIC-CLIP framework using a battery of 28 well-established classification and retrieval tasks, such as ImageNet, ImageNet layout shifts, and Flickr. Finally, using their benchmarks, they design and test a variety of continuous learning approaches, including replay buffers, learning rate schedules, and other types of checkpoints (such as warm start, patching, and distillation).
The team draws an important lesson: by starting training at the most recent checkpoint and replaying all historical data, the cumulative technique delivers performance on par with an Oracle with 2.7 times the computing efficiency. They also gain important insights into learning rate schedules for sequential training and show interesting tradeoffs between buffer sizes for static and dynamic performance. Their findings are consistent across dimensions and techniques, highlighting trends from data sets ranging from 11 million samples to 3 billion. Code and timing data collected on existing datasets will soon be made public so that the proposed benchmarks can be used by the broader community. The team hopes that by shedding light on this underexplored topic, their work can pave the way for the continued training of basic models.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>