Owl ViT is a computer vision model that has become very popular and has found applications in various industries. This model takes an image and a text query as input. After image processing, the result comes with a confidence score and the location of the object (from the text query) in the image.
The vision transformer architecture of this model allows it to understand the relationship between text and images, which justifies the image and text encoder it uses during image processing. Owl ViT uses CLIP so that similarities between image and text can be accurate with loss of contrast.
Learning objectives
- Learn about Owl ViT's no-shot object detection capabilities.
- Study the model architecture and image processing phases of this model.
- Explore Owl ViT object detection by running inference.
- Learn about real-life applications of Owl ViT.
This article was published as part of the Data Science Blogathon.
What is zero shot object detection?
Zero-shot object detection is a computer vision system that helps a model identify objects of different classes without prior knowledge. This model can take images as input and receive a list of candidates to choose from, which is most likely to be the object in the image. The capability of this model also ensures that you see the bounding boxes that identify the position of the object in the image.
Models like Owl ViT would need a large amount of pre-trained data to perform these tasks. Thus, the number of images of cars, cats, dogs, bicycles, etc., would be used during the training process. But with the help of zero-shot object detection, you can split this method using text and image similarities, allowing you to bring in text descriptions. Instead, the model uses its understanding of the language to perform the task. This concept is the basis of the architecture of this model, which leads us to the next section.
Owl ViT Base Patch32 Model Architecture
Owl ViT is an open source model that uses CLIP-based image classification. It can detect objects of any kind and match images to text descriptions using computer vision technology.
The basis of this model is its vision-transforming architecture. This architecture takes images in sequences of patches, which are processed by a transformer encoder.
The transformer encoder handles the model's language understanding to process the input text query. This is further processed by the vision transformer encoder, which works with the image in patches. The model can find the relationship between text descriptions and images with this structure.
The vision transformer architecture has become popular for many computer vision tasks. With the Owl ViT model, shotless object detection is a game-changer. The model can easily classify objects in images even with words it has not seen before, speeding up the pre-training process and image identification.
How to use this Owl ViT Base Patch 32 model?
So to put this theory into practice, we need to meet some requirements before running the model. We will use the Hugging Face Transformer library, which gives us access to open source transformer models and toolkits. There are a few steps to running this model, starting with importing the necessary libraries.
Import the necessary libraries
First of all, we need to import three essential libraries to run this model: request, PIL.image and torch. Each of these libraries is required for object detection tasks. Here's the brief breakdown;
The 'request' library is essential for making HTTPS requests and accessing the API. This library can interact with web servers, allowing you to download web content, such as images, using links. On the other hand, the PIL library allows you to open, download and modify images in different file formats. Torch is a deep learning framework that enables different tensor operations such as model training, GPU support, and matching learning tasks.
import requests
from PIL import Image
import torch
Loading the Owl ViT model
Providing pre-processed data for Owl ViT is another part of running this model.
from transformers import OwlViTProcessor, OwlViTForObjectDetection
This code ensures that the model can handle input formats, resize images, and work with inputs such as text descriptions. So you have pre-processed data and the adjusted tasks that you perform.
For this case, we used Owl for object detection, so we defined the processor and the expected input that the model would handle.
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
Image processing parameters
image_path = "/content/five cats.jpg"
image = Image.open(image_path)
texts = (("a photo of a cat", "a photo of a dog"))
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
An Owl ViT processor has to be compatible with the input you want to use. Therefore, using 'processor(text=texts, images=image, return_tensors=”pt”)' not only allows you to process image and text descriptions. This line also indicates that the preprocessed data should be returned as PyTorch tensors.
Here, we retrieve image_path using a file from our computer. This is an alternative to using a URL and calling PIL to load the image for the object detection task.
There are some common image processing parameters with the OWL-ViT model, and here we will briefly look at some of them;
- Pixel_values: This parameter generally represents raw image data passed in from multiple images. The pixel_values come in the form of torch.tensioner with the batch size, the color channels (num_channels) and the width and height of each image. Pixel values are usually represented in a range (for example, 0 to 1 or -1 to 1)
- Query_pixel_values: While you can find the raw image data for multiple images, this parameter allows you to provide the model with pixel data for specific images that it will attempt to identify within other target images.
- Attention_exit: The output parameter is an essential value for object detection models such as OWl ViT. Depending on the type of model, it allows you to display attention weights between tokens or image patches. Attention tensors can help the model visualize which part of the input it should prioritize, which is the detected object in this case.
- return_dict: This is another important parameter that helps the model return the output results of images that have gone through object detection. If set to “True”, you can easily access the result.
Processing text and image input for object detection
The texts show the list of candidates for the classes: “a photo of a cat” and a “photo of a dog.” Finally, the model pre-processes the text and image descriptions so that they are suitable as input to the model. The result will contain information about the object detected in the image, which, in this case, will be a confidence score. You can also use bounding boxes to identify the location of the image.
# Target image sizes (height, width) to rescale box predictions (batch_size, 2)
target_sizes = torch.Tensor((image.size(::-1)))
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)
This code prepares the image to fit the bounding box prediction and also ensures that the format is compatible with the data set containing the image. The result is a structured output of the detected objects, each with its bounding box and class label, suitable for evaluation or further use in applications.
Here's a simple breakdown;
target_sizes = torch.Tensor: This code defines the target image sizes in (height, width) format. Inverts the original image dimensions (width, height) and stores them as a PyTorch tensor.
Additionally, the code uses the 'processorpost_object_detection_process' Method to convert the raw output of the model into bounding boxes and class labels.
Image and text matching
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts(i)
boxes, scores, labels = results(i)("boxes"), results(i)("scores"), results(i)("labels")
Here you want to get the detection result by analyzing the text query, scores and labels of the detected object in the image. Full resources for this are available in this laptop.
Finally, we get a summary of the results after completing the object detection task. We can execute this with the code shown below;
# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
box = (round(i, 2) for i in box.tolist())
print(f"Detected {text(label)} with confidence {round(score.item(), 3)} at location {box}")
Real-life application of the Owl ViT object detection model
Today, many tasks involve computer vision and object detection. Owl ViT can be useful for each of the following applications;
- Image search is one of the most obvious ways to use this model. Because it can match text with images, users will only need to enter a text message to search for images.
- Object detection can also find useful applications in robotics to identify objects in their environment.
- Users with vision loss may also find this tool valuable, as this model can describe image content based on their text queries.
Conclusion
Computer vision models are traditionally versatile and Owl ViT is no different. Due to the model's zero-shot capabilities, you can use it without extensive pre-training. The strength of this model is based on leveraging CLIP and the vision transformer architecture for image and text matching, so its exploration is simplified.
Resources
Key takeaways
- Zero-shot object detection is a game-changer in this model architecture. It allows the model to perform tasks with images without prior knowledge of image classes. Text queries can also help identify objects, avoiding the need for large amounts of data for pre-training.
- This model's ability to match text and image pairs allows you to identify objects using textual descriptions and bounding boxes in real time.
- The Owl ViT's capabilities extend to real-life applications such as image search, robotics, and assistive technology for visually impaired users, highlighting the model's versatile computer vision applications.
Frequently asked questions
A. Zero-shot object detection allows Owl ViT to identify objects simply by matching textual descriptions of images, even if it has not been trained on that specific class. This concept allows the model to detect new objects based solely on text cues.
A. Owl ViT leverages a vision transformer architecture with CLIP, which relates images to text descriptions using contrastive learning. This phenomenon allows you to recognize objects based on text queries without prior knowledge of specific object classes.
A. Owl ViT can find useful applications in image search, robotic technology, and for visually impaired users. That means people with this challenge can benefit from this model as it can describe objects based on text input.
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.