Detection of open vocabulary objects is a critical aspect of several real-world computer vision tasks. However, the limited availability of detection training data and the fragility of pretrained models often lead to poor performance and scalability issues.
To address this challenge, the DeepMind research team introduces the OWLv2 model in their latest article, “Open Vocabulary Object Detection Scaling.This optimized architecture improves training efficiency and incorporates the OWL-ST self-learning recipe, substantially improving detection performance and achieving cutting edge results in the open vocabulary detection task.
The main goal of this work is to optimize the label space, annotation filtering, and training efficiency for the autotraining approach of open vocabulary detection, ultimately achieving robust and scalable open vocabulary performance with limited labeled data.
The proposed self-training approach consists of three key steps:
- The team uses an existing open vocabulary detector to perform open box detection on WebLI, a large-scale dataset of web text and image pairs.
- They use OWL-ViT CLIP-L/14 to annotate all WebLI images with pseudo-bounding box annotations.
- They fit the trained model using human-annotated detection data, further refining its performance.
In particular, the researchers employ a variant of the OWL-ViT architecture to train more effective detectors. This architecture takes advantage of contrastively trained image and text models to initialize the image and text encoders while the detection heads are initialized randomly.
During the training phase, the team employs the same losses and augments queries with “pseudo-negatives” of the OWL-ViT architecture, optimizing training efficiency to maximize utilization of available labeled images.
They also incorporate previously proposed practices for large-scale transformer training to further improve training efficiency. As a result, the OWLv2 model reduces training FLOPS by approximately 50% and speeds up training performance two times compared to the original OWL-ViT model.
The team compares their proposed approach with state-of-the-art open vocabulary detectors in their empirical study. The OWL-ST technique improves the Average Accuracy (AP) in rare LVIS classes from 31.2% to 44.6%. In addition, the combination of the OWL-ST recipe with the OWLv2 architecture achieves new next-generation performance.
Overall, the OWL-ST recipe presented in this paper significantly improves detection performance by taking advantage of weak monitoring of large-scale web data, enabling web-scale training for open-world localization. This approach addresses the limitations posed by the paucity of labeled detection data and demonstrates the potential to achieve robust detection of open vocabulary objects in a scalable manner.
review the Paper. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
Featured Tools:
🚀 Check out 100 AI tools at AI Tools Club
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.