Introduction
Image classification has found great real-life application by introducing better computer vision models and technology with more accurate results. There are many use cases for these models, but zero-shot classification and image pairs are some of the most popular applications of these models.
Google's SigLIP image classification model is a great example and comes with an important performance benchmark that makes it special. It is an image embedding model that is based on a CLIP framework but with even a better loss function.
This model also works solely with image and text pairs, comparing them and providing vector representation and probabilities. Siglip enables image classification into smaller matches while supporting larger scaling. What makes the difference for Google's siglip is the sigmoid loss that places it one level above CLIP. That means the model is trained to work with image and text pairs individually and not entirely to see which matches the most.
Learning objectives
- Understand the framework and model overview of SigLIP.
- Understanding the state-of-the-art performance of SigLIP.
- Learn more about the sigmoid loss function
- Learn about some real-life applications of this model.
This article was published as part of the Data Science Blogathon.
Model architecture of Google's SigLip model
This model uses a similar framework to CLIP (Contrastive Learning Image Pretraining) but with a small difference. Siglip is a multi-modal model computer vision system that gives you an edge for better performance. It uses a vision transform encoder for images, which means that images are divided into patches before being linearly embedded into vectors.
On the other hand, Siglip uses a transformer encoder for text and converts the input text sequence into dense embeddings.
Therefore, the model can take images as inputs and then perform zero-shot image classification. You can also use text as input, as it can be useful for search queries and image retrieval. The result would be image and text similarity scores to provide certain images through descriptions as required by certain tasks. Another possible outcome is input image and text probabilities, also known as zero-shot classification.
Another part of this model architecture is its language learning capabilities. As mentioned above, the contrastive learning image pre-training framework is the backbone of the model. However, it also helps align the image and text representation.
Inference streamlines the process and users can achieve great performance with the core tasks, namely zero-shot classification and image and text similarity scores.
What to Expect: SigLIP Scaling and Performance Insights
A change in the architecture of this model comes with a few things. This sigmoid loss opens up the possibility of further scaling with lot size. However, there is still a lot to be done in terms of performance and efficiency compared to the standards of other similar CLIP models.
The latest research aims to optimize the shape of this model, including the SoViT-400m. It would be interesting to see how its performance compares to other CLIP-type models.
Running Inference with SigLIP: Step-by-Step Guide
This is how you run inference with your code in just a few steps. The first part is to import the necessary libraries. You can enter the image using a link or upload a file from your device. It then requests its output using 'logits' and can perform tasks that check the scores and probability of text-image similarity. This is how they start;
Importing required libraries
from transformers import pipeline
from PIL import Image
import requests
This code imports the libraries needed to load and process images and perform tasks using pre-trained models obtained from HF. The PIL works to load and manipulate the image while the pipeline from the transformer library streamlines the inference process.
Together, these libraries can retrieve an image from the Internet and process it using a machine learning model for tasks such as classification or detection.
Loading the pretrained model
This step initializes the zero-shot image classification task using the transformer library and starts the process by loading the pre-trained data.
# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")
Preparing the image
This code loads the uploaded image from your local file using the PIL function. You can store the image and get 'image_path' to identify it in your code. Then 'image.open' function helps to read it.
# load image
image_path="/pexels-karolina-grabowska-4498135.jpg"
image = Image.open(image_path)
Alternatively, you can use the image URL as shown in the code block below;
url="https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg"
response = requests.get('https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg', stream=True)
Production
The model chooses the label with the highest score as the one that best matches the image, “a box.”
# inference
outputs = image_classifier(image, candidate_labels=("a box", "a plane", "a remote"))
outputs = ({"score": round(output("score"), 4), "label": output("label") } for output in outputs)
print(outputs)
This is what the output representation looks like in the image below;
The box label shows a higher score of 0.877, while the other is not even close.
Performance Benchmarks: SigLIP vs. Other Models
Sigmoid makes the difference in the architecture of this model. The original clip model uses the softmax function, which makes it difficult to define one class per image. The sigmoid loss function eliminates this problem, as Google researchers found a way to fix it.
Below is a typical example;
With CLIP, even when the image class is not present in the labels, the model still tries to return a result with a prediction that would be inaccurate. However, SigLIP eliminates this problem with a better loss function. If you try the same tasks, as long as the possible image description is not in the tag, you will have all the results, providing greater accuracy. You can check it in the image below;
With an image of a frame in the input, you get an output of 0.0001 for each label.
Application of the SigLIP model
There are some major uses of this model, but these are some of the most popular potential applications that users can employ;
- You can create a search engine for users to find images based on text descriptions.
- Image captioning is another valuable use of SigLIP, as users can caption images and analyze them.
- Visual answering of questions is also a brilliant use of this model. You can tune the model to answer questions about images and their content.
Conclusion
Google SigLIP offers a significant improvement in image classification with the Sigmoid feature. This model improves accuracy by focusing on individual image-text pair matches, enabling better performance on zero-shot classification tasks.
SigLIP's ability to scale and provide greater precision makes it a powerful tool in applications such as image search, captioning, and visual question answering. Its innovations position it as a leader in the field of multimodal models.
Key takeaway
- Google's SigLIP model improves on other CLIP-type models by using a sigmoid loss function, which improves accuracy and performance in classifying zero-shot images.
- SigLIP excels at tasks that involve comparing image and text pairs, enabling more accurate image classification and offering capabilities such as image captioning and visual question answering.
- The model supports scalability for large batch sizes and is versatile in various use cases such as image retrieval, classification, and search engines based on text descriptions.
Resources
Frequently asked questions
A. SigLIP uses a sigmoid loss function, which allows matching of individual image and text pairs and leads to better classification accuracy than CLIP's softmax approach.
A. SigLIP has applications for tasks such as image classification, image captioning, image retrieval through text descriptions, and visual question answering.
A. SigLIP classifies images by comparing them to the provided text labels, even if the model has not been trained on those specific labels, making it ideal for zero-shot classification.
A. The sigmoid loss function helps avoid the limitations of the softmax function by independently evaluating each image and text pair. This results in more accurate predictions without forcing a single class output.
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.