Introduction
As 2023 comes to a close, the exciting news for the computer vision community is that Google has recently made strides into the world of zero-shot object detection with the release of OWLv2. This cutting-edge model is now available in Transformers and represents one of the most robust zero-shot object detection systems to date. It builds on the foundation laid by OWL-ViT v1, which was introduced last year.
In this article, we will present the behavior and architecture of this model and look at a practical approach on how to run inference. Let us begin.
Learning objectives
- Understand the concept of zero-shot object detection in computer vision.
- Learn the technology and self-training approach behind Google’s OWLv2 model.
- A practical approach to using OWLv2.
This article was published as part of the Data Science Blogathon.
<h2 class="wp-block-heading" id="h-the-technology-behind-nbsp-owlv2″>The technology behind OWLv2
OWLv2’s impressive capabilities can be attributed to its novel self-training approach. The model was trained on a web-scale dataset comprising over a billion examples. To achieve this, the authors harnessed the power of OWL-ViT v1, using it to generate pseudotags, which were in turn used to train OWLv2.
Additionally, the model underwent adjustments to the sensing data, resulting in performance improvements over its predecessor, OWL-ViT v1. Self-training opens the possibility of web-scale training for open world localization, reflecting trends seen in object classification and language modeling.
OWLv2 architecture
While the architecture of OWLv2 is similar to that of OWL-ViT, there is one notable addition to its object detection head. Now includes an objectivity classifier that predicts the probability that a predicted frame contains an object. The objectivity score provides insight and can be used to rank or filter predictions independently of text queries.
Zero shot object detection
Zero-shot learning is a new terminology that has become popular since the GenAI trend. It is commonly seen in large language model (LLM) fine-tuning. It involves fine-tuning base models using some data so that a model extends to new categories. Zero-shot object detection is a game-changer in the field of computer vision. It involves training models to detect objects in images without the need for manually annotated bounding boxes. This not only speeds up the process but also eliminates manual annotations, making it more exciting for humans and less boring.
How to use OWLv2?
OWLv2 follows a similar approach to OWL-ViT but features an updated image processor, Owlv2ImageProcessor. Additionally, the model relies on CLIPTokenizer to encode text. Owlv2Processor is a useful tool that combines Owlv2ImageProcessor and CLIPTokenizer, simplifying the text encoding process. Below is an example of how to perform object detection using Owlv2Processor and Owlv2ForObjectDetection.
Find the full code here: https://github.com/inuwamobarak/OWLv2
Step 1: Set up the environment
In this step, we start by installing the Transformers library from GitHub.
# Install the 🤗 Transformers library from GitHub.
!pip install -q git+https://github.com/huggingface/transformers.git
Step 2: Load model and processor
Here, we load an OWLv2 checkpoint from the center. Note that checkpoint options are available and in this example we load a set checkpoint.
# Load an OWLv2 checkpoint from the hub.
from transformers import Owlv2Processor, Owlv2ForObjectDetection
# Load the processor and model.
processor = Owlv2Processor.from_pretrained(“google/owlv2-base-patch16-ensemble”)
model = Owlv2ForObjectDetection.from_pretrained(“google/owlv2-base-patch16-ensemble”)
# Load an OWLv2 checkpoint from the hub.
from transformers import Owlv2Processor, Owlv2ForObjectDetection
# Load the processor and model.
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
Step 3: Load and Process Images
In this step, we load an image in which we want to detect objects.
# Load an image that you want to analyze.
from huggingface_hub import hf_hub_download
from PIL import Image
# Replace the file paths accordingly.
filepath = hf_hub_download(repo_id="adirik/OWL-ViT", repo_type="space", filename="assets/astronaut.png")
image = Image.open(filepath)
Step 4: Prepare the image and queries for the model
OWLv2 is capable of detecting objects given text queries. In this step, we prepare the image and text queries for the model using the processor.
# Define the text queries that you want the model to detect.
texts = (('face', 'bag', 'shoe', 'hair'))
# Prepare the image and text for the model using the processor.
inputs = processor(text=texts, images=image, return_tensors="pt")
# Print the shapes of input tensors.
for key, val in inputs.items():
print(f"{key}: {val.shape}")
Step 5: Pass Forward
In this step, we forward the inputs through the model. We use torch.no_grad() to reduce memory usage since we don’t need gradients at inference time.
# Import the torch library.
import torch
# Perform a forward pass through the model.
with torch.no_grad():
outputs = model(**inputs)
Step 6: View the results
In this final step, we convert the model outputs to COCO API format and visualize the results by drawing bounding boxes and labels on the image.
# Convert model outputs to COCO API format.
target_sizes = torch.Tensor((image.size(::-1)))
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
# Retrieve predictions for the first image.
i = 0
text = texts(i)
boxes, scores, labels = results(i)("boxes"), results(i)("scores"), results(i)("labels")
# Draw bounding boxes and labels on the image.
from PIL import ImageDraw
draw = ImageDraw.Draw(image)
for box, score, label in zip(boxes, scores, labels):
box = (round(i, 2) for i in box.tolist())
x1, y1, x2, y2 = tuple(box)
draw.rectangle(xy=((x1, y1), (x2, y2)), outline="red")
draw.text(xy=(x1, y1), text=text(label))
# Display the image with bounding boxes and labels.
image
Image-guided single-shot object detection
We perform image-guided single-shot object detection using OWLv2. This means that we detect objects in a new image based on an example query image.
Code: https://github.com/inuwamobarak/OWLv2
# Import necessary libraries
# %matplotlib inline # Uncomment this line for compatibility if using Jupyter Notebook.
import cv2
from PIL import Image
import requests
import torch
from matplotlib import rcParams
import matplotlib.pyplot as plt
# Set the figure size
rcParams('figure.figsize') = 11, 8
# Load the input image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
target_sizes = torch.Tensor((image.size(::-1))
# Load the query image
query_url = "http://images.cocodataset.org/val2017/000000058111.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
# Display the input image and query image side by side.
fig, ax = plt.subplots(1, 2)
ax(0).imshow(image)
ax(1).imshow(query_image)
After loading the two images, we preprocess the input and print the shape.
# Define the device to use for processing.
device = "cuda" if torch.cuda.is_available() else "cpu"
# Process input and query images using the preprocessor.
inputs = processor(images=image, query_images=query_image, return_tensors="pt").to(device)
# Print the input names and shapes.
for key, val in inputs.items():
print(f"{key}: {val.shape}")
Next, we perform image-guided object detection. We print the shapes of the model results, including the vision model results.
# Perform image-guided object detection using the model.
with torch.no_grad():
outputs = model.image_guided_detection(**inputs)
# Print the shapes of the model's outputs.
for k, val in outputs.items():
if k not in {"text_model_output", "vision_model_output"}:
print(f"{k}: shape of {val.shape}")
print("\nVision model outputs")
for k, val in outputs.vision_model_output.items():
print(f"{k}: shape of {val.shape}")
Finally, we visualize the results by drawing bounding boxes on the image. The code handles the conversion of the image to RGB format and post-processes the detection results.
# Visualize the results
import numpy as np
# Convert the image to RGB format.
img = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
outputs.logits = outputs.logits.cpu()
outputs.target_pred_boxes = outputs.target_pred_boxes.cpu()
# Post-process the detection results.
results = processor.post_process_image_guided_detection(outputs=outputs, threshold=0.9, nms_threshold=0.3, target_sizes=target_sizes)
boxes, scores = results(0)("boxes"), results(0)("scores")
# Draw bounding boxes on the image.
for box, score in zip(boxes, scores):
box = (int(i) for i in box.tolist())
img = cv2.rectangle(img, box(:2), box(2:), (255, 0, 0), 5)
if box(3) + 25 > 768:
y = box(3) - 10
else:
y = box(3) + 25
# Display the image with predicted bounding boxes.
plt.imshow(img(:, :, ::-1))
Expanding Open Vocabulary Object Detection
Open vocabulary object detection has benefited from pretrained vision and language models. However, it is often hampered by the limited availability of detection training data. To address this, the authors turned to self-training and existing detectors to generate pseudo-frame annotations on image-text pairs. Scaling up self-training presents its own set of challenges, including the choice of label space, pseudo-annotation filtering, and training efficiency.
OWLv2 and the OWL-ST autotraining recipe have been developed to overcome these challenges. As a result, OWLv2 now outperforms previous state-of-the-art open vocabulary detectors, even at similar training scales of around 10 million examples.
Impressive performance and scaling
The performance of OWLv2 is really impressive. With an L/14 architecture, OWL-ST improves the average precision (AP) in rare LVIS classes. Even when the model has not seen human frame annotations for these rare classes, it achieves this improvement, with AP increasing from 31.2% to 44.6%.
OWL-ST’s ability to scale to over a billion examples signifies an achievement in web-scale training for open world localization, similar to what we have witnessed in object classification and language modeling.
Conclusion
OWLv2 and the innovative OWL-ST self-training recipe represent a step forward in zero-shot object detection. These advances promise to reshape the computer vision landscape by making it easier and more efficient to detect objects in images without the need for manually annotated bounding boxes. We encourage you to explore OWLv2 and its applications in your projects. The possibilities are exciting and we are excited to see how the computer vision community leverages this technology for innovative solutions.
Key takeaways
- OWLv2 is Google’s latest model for shotless object detection, available in Transformers, and is based on the previous version, OWL-ViT v1.
- Zero-shot object detection eliminates the need for manually annotated bounding boxes, making the process more efficient and less tedious.
- OWLv2 uses self-training on a web-scale dataset of over 1 billion examples and leverages OWL-ViT v1 pseudotags to improve performance.
Frequent questions
A1: Zero-shot object detection is a way for models to detect objects in images without the need for manually annotated bounding boxes. It is important because it speeds up the object detection process and makes it less laborious.
A2: Self-training involves using an existing detector to generate pseudo-frame annotations on image and text pairs. OWLv2 takes advantage of this self-training approach to improve performance and scalability.
A3: The objectivity classifier in the OWLv2 object detection head predicts the probability that a predicted frame contains an object. Use this information to sort or filter predictions independently of text queries.
A4: Use OWLv2 with processors such as Owlv2ImageProcessor, CLIPTokenizer, and Owlv2Processor to perform text-conditional object detection. Practical examples are available in the article.
A5: Self-training addresses challenges such as label space choice, pseudo-annotation filtering, and open vocabulary object detection at training scale.
A6: OWLv2 capabilities have the potential to benefit computer vision applications, including object detection, image understanding, and more. Researchers and developers can leverage this technology to find innovative solutions.
Referral links
- https://github.com/inuwamobarak/OWLv2
- https://huggingface.co/docs/transformers/main/en/model_doc/owlv2
- https://arxiv.org/abs/2306.09683
- https://huggingface.co/docs/transformers/main/en/model_doc/owlvit
- https://arxiv.org/abs/2205.06230
- Minderer, M., Gritsenko, A., & Houlsby, N. (2023). Expanding open vocabulary object detection. ArXiv. /abs/2306.09683
The media shown in this article is not the property of Analytics Vidhya and is used at the author’s discretion.