Machine vision relies heavily on segmentation, the process of determining which pixels in an image represent a particular object for uses ranging from scientific image analysis to the creation of artistic photography. However, building an accurate segmentation model for a given task typically requires the assistance of technical experts with access to AI training infrastructure and large volumes of carefully annotated domain data.
Recent Meta AI research features their project called “Segment Anything,” which is an effort to “democratize segmentation” by providing a new task, dataset, and model for image segmentation. Its Segment Anything Model (SAM) and Segment Anything 1-Billion mask dataset (SA-1B), the largest segmentation dataset ever.
There used to be two main categories of strategies for dealing with segmentation issues. The first, interactive segmentation, could segment any object, but required a human operator to iteratively refine a mask. However, automatic segmentation allowed segmenting predefined object categories. Still, it required a large number of manually annotated objects, as well as computing resources and technical expertise, to train the segmentation model. Neither method offered a universally automated, foolproof means of segmentation.
SAM encompasses these two broader categories of methods. It is a unified model that performs interactive and automated segmentation tasks effortlessly. Due to its flexible request interface, the model can be used for various segmentation tasks simply by designing the appropriate request. Additionally, SAM can generalize to new types of objects and images because it is trained on a high-quality, diverse dataset of over a billion masks. In general, practitioners won’t have to collect their segmentation data and fit a model for their use case due to this generalizability.
These features allow SAM to transfer to different domains and perform different tasks. Some of the capabilities of the SAM are as follows:
- SAM makes it easy to segment objects with a single mouse click or by interactive selection of points for inclusion and exclusion. A contour box can also be used as an indication for the model.
- For practical segmentation problems, SAM’s ability to generate competitive valid masks against object ambiguity is a crucial feature.
- SAM can instantly detect and mask any object in an image.
- After precomputing the image embedding, SAM can instantly generate a segmentation mask for any request, allowing real-time interaction with the model.
The team needed a large and varied data set to train the model. SAM was used to collect the information. In particular, annotators used SAM to perform interactive image annotations, and the resulting data was subsequently used to refine and improve SAM. This loop was run multiple times to refine the model and data.
New segmentation masks can be collected at lightning speeds using SAM. The tool used by the team makes interactive mask annotation quick and easy, taking only about 14 seconds. This model is 6.5 times faster than COCO’s fully manual polygon-based mask annotation and 2 times faster than the largest previous data annotation effort, which was also model-assisted compared to previous collection efforts. segmentation data on a large scale.
The reported dataset of one billion masks could not have been created with interactively annotated masks alone. As a result, the researchers developed a data engine to use when collecting data for the SA-1B. There are three “cogs” in this data “engine”. The first mode of operation of the model is to assist human annotators. On the next march, fully automatic annotation combines with human assistance to expand the range of skins collected. Finally, fully automated masking supports the scalability of the data set.
The final dataset has more than 11 million images with licenses, privacy protections, and 1.1 billion targeting masks. Human evaluation studies have confirmed that the masks in SA-1B are of high quality and diversity and are comparable in quality to masks from the much smaller, manually annotated previous datasets. SA-1B has 400 times more masks than any existing segmentation data set.
The researchers trained SAM to provide an accurate segmentation mask in response to various inputs, including foreground/background points, an approximate box or mask, free-form text, and so on. They found that the pretraining task and interactive data collection placed particular constraints on the model design. . For annotators to use SAM effectively during annotation, the model must run in real time on a CPU in a web browser.
A lightweight encoder can instantly transform any request into an embed vector, while an image encoder creates a unique embed for the image. A lightweight decoder is then used to combine the data from these two sources into a segmentation mask prediction. Once the image embedding has been calculated, SAM can respond to any query in a web browser with a segment in less than 50ms.
SAM has the potential to drive future applications in a wide variety of fields that require locating and segmenting any object in a given image. For example, understanding the visual and textual content of a web page is just one example of how SAM could be integrated into larger AI systems for a general multimodal understanding of the world.
review the Paper, Manifestation, Blog and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.
Must Read: What is AI Hallucination? What goes wrong with AI chatbots? How to detect an amazing artificial intelligence?