When it comes to natural language processing (NLP) tasks, large language models (LLMs) trained on massive online data sets perform exceptionally well. Segment Anything Model (SAM) has demonstrated excellent machine vision (CV) zero-shot locating capabilities by scaling data.
Unfortunately, SAM cannot produce semantic tags, a fundamental task on par with localization. Recognizing many tags for a single image is the goal of multi-tag image recognition, also known as image tagging. Since images contain various tags, including objects, scenes, properties, and activities, image tagging is an important and useful machine vision problem.
Two main factors make image tagging difficult as follows:
- The extensive collection of high-quality data. There is still a lack of an efficient data annotation engine that can semi-automatically or automatically annotate massive amounts of photos in various categories, as well as a comprehensive, standardized tagging system.
- There are not enough powerful, open-vocabulary models built with efficient and flexible model design that take advantage of large-scale data with weak supervision.
The Recognize Anything Model (RAM) is a robust base model for image labeling, and has just been unveiled by researchers from the OPPO Research Institute, the International Academy of Digital Economy (IDEA), and AI2 Robotics. When it comes to data, RAM can overcome issues like inadequate tagging systems, insufficient data sets, inefficient data engines, and architectural constraints.
The researchers start by creating a standard global naming convention. They use academic data sets (classification, detection, and segmentation) and commercial taggers (Google, Microsoft, and Apple) to enrich their tagging system. By combining all available public tags with common text-based tags, the tagging method produces 6449 tags that collectively address the vast majority of use cases. The researchers claim that it is possible to recognize the remaining open vocabulary labels using open set recognition.
Annotating large-scale photos using the tagging system automatically is a challenging task. The proposed approach to image tagging is inspired by previous work in the field, which uses large-scale public image-text pairs to train robust visual models. To make good use of these massive amounts of image text data for labeling, the team employed semantic machine text analysis to extract the image labels. Using this method, they were able to get a large set of image tags based on image-text pairs without relying on manual annotations.
Image and text combinations on the Internet tend to be inaccurate due to random noise. The team builds a data labeling engine to improve annotation accuracy. To solve the problem of missing tags, they adopt pre-existing models to produce complementary classifications. When it comes to mislabeled areas, they flag certain sections within the image that map to different labels. They then use region clustering methods to find and eliminate anomalies within the same category. Additionally, tags that make inconsistent predictions are also removed for more accurate annotation.
RAM allows generalization to new classes by adding semantic context to tag lookups. RAM identification capabilities can be enhanced with this model architecture for any visual data set, demonstrating its versatility. By showing that a general model trained on noisy data and no annotations can beat highly supervised models, RAM introduces a new paradigm for image labelling. RAM requires a free, publicly available dataset without annotations. The most powerful version of RAM only needs to be trained for three days on eight A100 GPUs.
According to the team, improvements can still be made to the RAM. This includes running many iterations of the data engine, increasing backbone parameters to increase model capacity, and expanding the training dataset beyond 14 million photos to better cover varied areas.
review the Paper, Project, and Github. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.