The success of universal prompt-based interfaces for LLMs like ChatGPT has paved the way for the importance of modern AI models in human-AI interactions, opening up numerous possibilities for further research and development. In visual comprehension, tasks have not received as much attention in the context of human-AI interactions, and new studies are now starting to emerge. One such task is image segmentation, which aims to divide an image into several segments or regions with similar visual characteristics, such as color, texture, or an object class. Interactive image segmentation has a long history, but segmentation models that can interact with humans through interfaces that can receive multiple types of messages, such as texts, clicks, and images, or a combination of these, have not been well explored. . Most current targeting models can only use spatial hints like clicks or scribbles or reference targeting using the language. Recently, a segmentation model called SAM introduced a model that could support multiple indications, but its interaction is limited to boxes or points only, and it does not provide semantic labels as output.
This paper, presented by researchers at the University of Wisconsin-Madison, introduces SEEM, a new approach to image segmentation that uses a universal interface and multimodal messaging. The acronym stands for Segmentation of Everything, Everywhere, All at Once in One Image (in reference to the movie, in case you missed it!). This innovative new model was built with 4 main features in mind: versatility, compositionality, interactivity, and semantic awareness. For versatility, his model allows the use of inputs such as points, masks, text, boxes, and even a referenced region of another apparently heterogeneous image. The model can handle any combination of those input cues, leading to strong compositionality. The interactivity aspect comes from the model’s ability to use cues from memory to interact with other cues and retain previous segmentation information. Finally, semantic awareness refers to the model’s ability to recognize and label different objects in an image based on their semantic meaning (for example, distinguish between different types of cars). SEEM can provide open set semantics to any output segmentation, which means that the model can recognize and segment objects that were never seen during training. This is really important for real world applications where the model can find new and never seen before objects.
The model follows a simple transformer encoder-decoder architecture with additional text encoded. All queries are taken as hints and fed into the decoder. The image encoder is used to encode all spatial queries, such as dots, boxes, and squiggles, into visual cues, and the text encoder is used to convert text queries into textual cues. The prompts of the 5 different types are then mapped to a joint visual-semantic space, allowing users to not see the prompts. Different types of orders can help each other through cross-attendance, so composite orders can be used for better segmentation results. Furthermore, the authors state that SEEM is efficient to run, since when performing multi-round interactions with humans, the model only needs to run the feature extractor (heavy) once at first and then run the decoder (light) with each new request.
The researchers conducted experiments to show that their model performs highly on many segmentation tasks, including closed-set and open-set segmentations of different types (interactive, reference, panopticon, and segmentation with combined cues). The model was trained in panoptic and interactive segmentation with COCO2017, with 107K segmentation images in total. For reference segmentation, they used a combination of sources for image annotations (Ref-COCO, Ref-COCOg, and Ref-COCO+). To assess performance, they used standard metrics for all segmentation tasks, such as Panoptic Quality, Average Precision, and Mean Intersection Over Join. For interactive segmentation, they used the number of clicks required to achieve a given intersection over join.
The results are very promising. The model works well for all three types of segmentation: interactive, generic, and referral segmentation. For interactive segmentation, its performance is even comparable to SAM (which is trained on 5x more segmentation data), plus it allows for a wide range of user input types and provides robust composition capabilities. The user can click or doodle on an input image or enter text, and SEEM can produce masks and semantic labels for the objects in that image. For example, the user can enter “the black dog” and SEEM can draw the outline around the black dog in the image and add the tag “black dog”. The user can also input a reference image with a river and draw a doodle on the river, and the model can find the river and tag it on other images. It is remarkable to say that the model shows powerful generalization capabilities to unseen scenarios such as cartoons, movies, and games. The model can label objects in a zero-shot fashion, that is, it is capable of classifying new instances of never-before-seen classes. You can also accurately segment objects in different frames of a movie, even when the object changes appearance due to strong warping or blur.
In conclusion, SEEM is a powerful and state-of-the-art segmentation model that can segment everything (all semantics), everywhere (at every pixel in the image), all at once (supports all cue compositions). It is the first preliminary step toward a universal and interactive interface for image segmentation, bringing machine vision closer to the kinds of advances seen in LLMs. Performance is currently limited by the amount of training data and is likely to improve with larger segmentation data sets, such as that currently being developed by the SAM concurrent job. Supporting part-based segmentation is another avenue to explore to improve the model.
review the Paper and GitHub link. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Nathalie Crevoisier has a BS and MS in Physics from Imperial College London. She spent a year studying applied data science, machine learning, and internet analytics at the Ecole Polytechnique Federale de Lausanne (EPFL) as part of her degree. During her studies, she developed a keen interest in AI, which led her to join Meta (formerly Facebook) as a data scientist after graduation. During her four-year tenure with the company, Ella Nathalie worked across multiple teams, including Ads, Integrity, and Workplace, applying cutting-edge data science and ML tools to solve complex problems affecting billions of users. Seeking more independence and time to keep up with the latest AI discoveries, she recently decided to transition into a freelance career.