The most popular paradigm for solving modern vision tasks like image classification/object detection etc. on small data sets involves fine tuning the latest pre-trained deep network, which was previously based on ImageNet and now probably be based on CLIP. The current pipeline has been highly successful, but still has some limitations.
Probably the main concern relates to the enormous amount of effort required to collect and label these large sets of images. Notably, the size of the most popular pretraining dataset has grown from 1.2M (ImageNet) to 400M (CLIP) and it doesn’t seem to be stopping. As a direct consequence, also the training of generalist networks requires large computational efforts that today only a few industrial or academic laboratories can afford. Another critical problem with such collated databases is their static nature. In fact, despite being huge, these data sets are out of date. Hence, its expressive power with respect to known concepts is limited in time.
Recent work by researchers at Carnegie Mellon University and the University of Berkley proposes treating the Internet as a special dataset to overcome the aforementioned problems of the current pretraining and tuning paradigm.
In particular, the paper proposes a reinforcement learning-inspired disembodied online agent called Internet Explorer that actively searches the Internet using standard search engines to find relevant visual data that improves the quality of features in a target dataset.
Agent actions are text queries made to search engines, and observations are the data obtained from the search.
The proposed approach is different from active learning and related work by performing a directed search for active improvement in a fully self-supervised manner on an expanding dataset that does not require labels for training, even from the target dataset. Notably, the approach is not applied to a single data set and does not require the intervention of expert labellers, as in standard active learning.
In practice, Internet Explorer uses WorNet concepts to query a search engine (for example, Google Images) and embeds those concepts in a representation space to learn, over time, the relevant query ID. The model takes advantage of self-supervised learning to learn useful representations of unlabeled images downloaded from the Internet. The initial vision encoder is a self-supervised and pretrained MoCoV3 model. Images downloaded from the Internet are classified according to self-monitored loss to understand their similarity to the target dataset as an indicator of their relevance for training.
In five detailed and challenging popular benchmarks, i.e. Birdsnap, Flowers, Food101, Pets, and VOC2007, Internet Explorer (with the additional use of GPT-generated descriptors for concepts) manages to rival CLIP Oracle ResNet 50 by reducing the amount of computing and train images by one and two orders of magnitude, respectively.
In summary, this paper presents a novel and intelligent agent that queries the web to download and learn useful information to solve a given image classification task at a fraction of the training costs of previous approaches and opens up new research on the issue.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Lorenzo Brigato is a Postdoctoral Researcher at the ARTORG center, a research institution affiliated with the University of Bern, and is currently involved in the application of AI to health and nutrition. He has a PhD. He graduated in Computer Science from the Sapienza University in Rome, Italy. His PhD thesis focused on image classification problems with poor data distributions across samples and labels.