Instance segmentation refers to the computer vision task of identifying and differentiating multiple objects that belong to the same class within an image by treating them as distinct entities. In recent years, there has been a significant increase in the number of instances of segmentation techniques due to rapid advances in deep learning techniques. For example, Convolutional Neural Networks (CNN) and other progressive architectures like Mask R-CNN are used for instance segmentation. The dominant feature of such techniques is that they combine object detection capabilities with pixel segmentation to identify objects and generate accurate masks for each instance within an image, leading to a better understanding of the overall image.
However, there is a certain drawback to existing detection models with respect to the number of basic categories that they can identify. Previous trials have indicated that if a detection model is trained on the COCO dataset, its ability to detect approximately 80 categories can be achieved. However, any additional categories would require human participation, which is laborious and time consuming. To counteract this, there are Open Vocabulary (OV) methods that take advantage of image caption pairs and vision language models to learn new categories. However, there are big differences in supervision when it comes to learning from basic and novel categories. This often leads to overfitting in basic categories and poor generalization to new ones. As a result, there is a strong requirement for a methodology that can improve these detection methods to detect new categories without much human intervention. This would make the models more practical and scalable for real world applications.
To address this issue, Salesforce AI researchers have devised a method in which instance mask and bounding box annotations are generated from an image title pair. Their proposed method, The Mask-free OVIS pipeline, takes advantage of weak supervision by using pseudomask annotations derived from a vision-language model to learn basic and novel categories. This approach eliminates the need for laborious human annotation and addresses the problem of overfitting. Experimental evaluations have shown that their methodology outperforms state-of-the-art open vocabulary instance segmentation models. Additionally, his research has been recognized and accepted at the prestigious Computer Vision and Pattern Recognition Conference in 2023.
Salesforce researchers have designed a pipeline that consists of two main stages: pseudo-mask generation and open vocabulary instance segmentation. In the first stage, a pseudomask annotation for the object of interest is created from the image-title pair. Using a pre-trained vision language model, the object name serves as a text message to locate the object. Additionally, an iterative masking process is performed with GradCAM to refine the pseudo-mask and ensure that it covers the entire object accurately. In the second stage, a weakly supervised segmentation (WSS) network is trained to select the proposal with the highest overlap with the GradCAM activation map using previously generated bounding boxes. Finally, a Mask-RCNN model is trained using the generated pseudo annotations, completing the pipeline.
The pipeline therefore eliminates the need for any human involvement by harnessing the power of pretrained vision and language models and weakly supervised models to automatically generate pseudo-mask annotations, which can be employed as input data. additional training. To test their pipeline, the researchers performed several experiments on searched data sets such as the MS-COCO and OpenImages data sets. The results demonstrated that the use of pseudo-annotations in their approach leads to exceptional performance in instance detection and segmentation tasks, outperforming other methods that rely on human annotations. The unique vision- and language-driven approach to pseudo-annotation generation, devised by Salesforce researchers, paves the way for more advanced and accurate instance segmentation models that eliminate the need for human annotators.
review the Paper, Project, and Reference article. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
featured tools Of AI Tools Club
🚀 Check out 100 AI tools at AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.