Today’s paper walkthrough it is going to be visual! We will analyze Segment Anything, a paper by Meta’s AI research team that made headlines not only in the research community but also by all sorts of deep learning practitioners and advocates.
Segment Anything introduces the task of promptable segmentation, it introduces the segment anything model (SAM), and it details the generation of a new publicly available dataset of 11 million images containing more than 1 billion masks. SAM has been widely adopted by the community and resulted in some new state-of-the-art foundation models such as Grounded-SAM that combines Grounding DINO with SAM.
Paper: Segment Anything
Code: https://github.com/facebookresearch/segment-anything
First Published: 5 Apr. 2023
Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick
Category: segmentation, zero-shot prediction, computer vison, prompting, large-scale
- Context & Background
- SAM — Segment Anything Model
- SA-1B — Dataset with 1 Billion Masks
- Experiments and Ablations
- Conclusion
- Further Readings & Resources
The authors of Segment Anything made a clear statement: “[…] our goal is to build a foundation model for image segmentation.” Foundation models originated from the great success of Natural Language Processing (NLP). Models have been trained on a laaaarge scale in a self-supervised fashion. These models usually perform very well at zero-shot tasks, meaning they can solve tasks different to those they were trained on and perform reasonable well or even better as their supervised competitors. In recent years, many researchers worked on bringing the success of NLP foundation models to other domains such as computer vision.
Models such as CLIP and GLIP made it possible to condition an image classification or object detection task on text prompts, rather than a fixed set of classes. Other models, such as BYOL or DINO, came up with different techniques to learn semantically rich representations of input images, which is one of the key requirements for many computer vision applications.
The Segment Anything paper aims to:
- Enable zero-shot segmentation by prompting
- Train a large-scale model (SAM) as demonstrator
- Collect and release the largest publicly available dataset for segmentation.
But why is zero-shot performance so important? — The answer is two-fold. First, initially computer vision models have been trained in a supervised fashion requiring not only data, but also a lot of ground truth labels. Collection these data is extremely time-consuming and costly. Second, the classes a model can predict are limited to a fixed set of classes used for training. If you would like to add a new class to your model, you would need to first collect the data and retrain the model.
How is it possible to prompt a segmentation model? — You might be familiar with text prompting from models like ChatGPT, CLIP or GLIP. While SAM in principle also was tested with text prompts, it is mainly prompted with either masks, points, boxes or point grids as shown in the image bellow.
Having put SAM into context let’s now switch gears and have a closer look onto the Segment Anything Model, aka SAM.
The Segment Anything Model (SAM) is a multi-modal model that inputs an image and one or more prompts and outputs a valid segmentation mask. The model consists of three main modules: image encoder, prompt encoder and mask decoder.
SAM is prompted with either a mask, a set of points, a bounding box or text or any combination of those.
NOTE: Even though the paper mentions and experiments with text as prompt, it is not yet released (as of September 2023) in the official implementation nor in the SAM demo.
Image Encoder — Outputs an image embedding for a given input image. SAM implements and adapts a pre-trained ViT-H/16 masked auto-encoder. This is a relatively large model with strong performance.
Prompt Encoder — Sparse prompts (i.e. points, boxes and text) are translated into embedding vectors. Text prompts are converted into text embeddings using CLIP before feeding it into the prompt encoder. Dense prompts (i.e. masks) are simply downsampled with strided convolutions and added with the image embeddings. All embeddings are then fed into the final stage: the mask decoder.
Mask Decoder — Takes a set of image embeddings (optionally containing the dense mask embeddings) and a set of prompt embeddings and outputs a valid segmentation mask.
There are two more details we should address: ambiguities of prompts and performance.
In a nutshell, the less context a prompt contains, the more ambiguous it is and the more difficult it is for the model the provide the correct output. For text-prompts we have seen this connection between the specificness of the input text and the model’s performance in CLIP and GLIP. Similar, providing a single point as input might result in a variety of possible masks. For that reason, SAM outputs a set of three output masks corresponding to the object level, the part level and the sub-part level of a valid mask as indicated in the image bellow.
The second detail I want to mention is performance in terms of inference speed. Did you notice that the image encoder is by far the largest sub-module in SAM? Well, that’s an unfair question because I did not tell you so far, but SAM is designed in a way to have semantically rich image embeddings (which often requires a large model) to subsequently act upon these embeddings applying a light-weight prompt encoder and a light-weight mask decoder. The good thing: one must only run the image encoder once per image and can then prompt the model multiple times using the same image embedding. This allows SAM to be executed in a browser taking only ~50ms to predict a mask for a given prompt (after the image embeddings have been calculated).
Let’s have a closer look on the light-weight mask decoder. It inputs the image embeddings and prompt embeddings and outputs a set of masks with corresponding scores. Internally, two consecutive decoder blocks perform a combination of self-attention and cross-attention to generate a strong dependence between the image and the prompts. A simple up-sampling network in combination with another cross-attention block generates the masks and the scores.
The second great deal of Segment Anything was the creation and release of a large-scale dataset for segmentation. It contains 11 million high-resolution and licensed images with roughly 1.1 billion masks. While the original version of the dataset have 3300×4950 pixels on average, the released version is downsampled to have 1500 pixels at the shortest edge. It is diverse in terms of different scenes and number of masks per image ranging from less than 50 to more than 500.
The dataset has been created in a three-stage data engine which combines manual labels annotated by humans with automatic labels generated by SAM.
Stage 1: Assisted-manual Stage — A team of professional labelers labeled images assisted by an early version of SAM trained on common segmentation datasets. They were asked to label the most prominence objects and were encouraged to proceed after 30 seconds. At the end of this stage, SAM was retrained with the new labels (total 120k images with 4.3M masks).
Stage 2: Semi-automatic Stage — In this stage the goal was to increase the diversity of the masks by first letting SAM predict some masks and let the labelers annotate the missing less prominence objects. At the end of this stage, SAM was retrained again including the new samples (total 300k images with 10.2M masks).
Stage 3: Fully automatic Stage — In this stage, annotation was fully automatic. SAM was prompted with a 32×32 grid of points to generate masks and applied some post-processing.
Dataset Analysis
Now let’s have a closer look on some analysis concerning the SA-1B dataset presented in the paper.
In a first evaluation, the authors created a normalized distribution of the masks’ center point. Interestingly, these distributions are subject to a photographer’s bias, meaning, most photos center the object of interest in the images center and the main axis.
One of SA-1B’s major strengths is the high number of masks per image compared to other datasets (Fig.7 left). This also implies that SA-1B has many small masks (Fig.7 center). Comparing the masks’ concavity. which is a measure of complexity, SA-1B is very similar to other datasets that have been manually labeled (Fig.7 right).
A high focus is put on responsible AI (RAI), where biases towards certain groups of people is not only analyzed but also tried to be mitigated. As Fig.8 shows, most of the world’s countries have more than 1000 images and the top 3 countries are from different parts of the world. While low-income countries are still under represented relatively speaking (0.9% of all samples), on an absolute scale these are still over 9M masks and more than other segmentation datasets.
The authors further investigated the performance discrepancy between perceived gender presentation, perceived age group and perceived skin tone. They provided the mean IoU (Intersection over Union), between the predicted masks and the ground truth masks and a 95% confidence interval. SAM is prompted with either a single point or three points. The key message is, that results are very similar (and overlapping confidence intervals) within a group which shows that no member of the group is favored. The only exception are older people in the perceived age group.
Segment Anything did provide us with a bunch of experiments mainly focused on its zero-shot performance, since this was the main target of the authors: to find a promptable zero-shot segmentation model. Also we know from other models such as CLIP and GLIP, that prompt tuning is nearly as effective as fine-tuning a model in terms of performance.
To perform the experiments, a suite of 23 diverse datasets was compiled. It contains samples from a wide variety of data distributions as shown in Fig. 10.
Zero-Shot Single Point Valid Mask Evaluation
Recall that zero-shot means the model was never trained on the data it is exposed to during the evaluation. Also recall that single point prompting is quite a difficult task due to its ambiguity as depicted in Fig.3.
In this first experiment, the authors compared SAM against RITM, a strong interactive segmenter which they said performed best on their benchmarks.
Remember that SAM outputs 3 different masks with an associated score when prompted with a single point. In this experiment, the mask with the highest score is selected for the evaluation. Since this is sometimes wrong, the authors also evaluate on the best mask, which is determined by comparing the predictions to the ground truth masks and select those with the highest overlap. These are the “oracle” predictions.
SAM outperforms RITM in 16 of the 23 datasets in zero-shot single point valid mask prediction. When performing oracle predictions, it outperforms RITM in all 23 datasets.
Zero-Shot Text-to-Mask
In this experiment SAM was prompted with text. The authors refer to this feature as a proof of concept and hence neither perform extensive experiments nor release this feature in their official code implementation.
Looking at Fig.12, you can see that SAM is able to return correct masks even for complex objects like the “beaver tooth grille”. In some other cases, the model fails with inserting only text prompts and they show that when providing context in form of a point, SAM is able correctly predict either a single or multiple wipers, showing that not only the point is considered for prediction but also the text.
Zero-Shot Edge Detection
Interestingly, SAM can also be used for edge detection, a task it was not considered to do nor did it had access to such data during training.
To predict maps, SAM is first prompted with a grid of 16×16 points resulting in 768 predicted masks (object, part and sub-part for each of the 256 points). The resulting masks are then filtered and post-processed to obtain the edge masks.
As shown in Fig. 13, compared to the ground truth, SAM predicts much more details. But to be fair, if the GT is not complete or covers a different layer of abstraction, this comparison seems not fair to me. But still, the performance is quite good!
Zero-Shot Instance Segmentation
For this experiment, SAM is prompted with a bounding box output of a fully supervised ViTDet-H trained on COCO and LVIS. The resulting mask is then fed again into SAM together with the initial bounding box to refine the result. A comparison between ViTDet and SAM is shown in Fig.14.
Two things to note here: If you have a look into COCO and LVIS, you will find that the masks are not pixel-aligned with the objects. This bias is present in ViTDet, so that’s why the quality of SAM seems to be better. How much better is hard to tell with computed metrics, since the ground truth has the same bias and compared to a bad GT, SAM would perform worse. Hence, they asked humans to visually inspect those. Second, why does this elephant only has 3 legs 😅. No matter how hard I try I can’t see the fourth one…
Ablations
In the ablation section the authors were mainly concerned with scaling either the dataset, the number of points for prompting and the size of the image encoder (see Fig.13). Performance is reported in mean IoU.
Interestingly, even though scaling data and scaling model size influences the mIoU performance, it saturates. This might either indicate that the model is so good there is not much room for improvement or probably it’s the limitation of their approach.
Segment Anything introduced the promptable Segment Anything Model (SAM) as well as a large-scale dataset for segmentation containing over 1 billion masks in over 11 million images. Being able to prompt a segmentation model brings a lot of flexibility like adapting a trained model to unseen tasks or to be able to detect unknown classes. While some debate if SAM is considered a foundation model since it has ben trained in a supervised manner, it still has shown remarkable results and has been widely adopted.
As you probably know yourself: the field of deep learning is evolving in an unbelievably fast pace. Hence it is no surprise that right after the release of SAM, many new projects build upon its success, further improving either the quality of predictions, decreasing inference time or make the model suitable for on the edge applications.
Following a list of interesting resources building upon SAM: