Finding objects in images has been a long-standing task in machine vision. Object detection algorithms try to locate objects by drawing a box around them, while segmentation algorithms try to pinpoint object boundaries perfectly. Image segmentation aims to divide an image into different regions or objects based on their semantic meaning or visual characteristics. It is crucial in various applications including object recognition, scene understanding, autonomous driving, medical imaging, and more.
Over the years, numerous methods and algorithms have been developed to address this challenging problem. Traditional approaches use handcrafted features, and more recent advances have brought us models driven by deep learning models. These modern methods have shown remarkable progress, achieving state-of-the-art performance and enabling new possibilities in image understanding and analysis.
However, these models had a fundamental limitation. They were limited by the objects they saw in the training set and were unable to segment the remaining objects.
Then came the Segment Anything Model (SAM) which completely changed the image segmentation game. It has emerged as an innovative vision model capable of segmenting any object within an image based on user interaction cues. It is based on a Transformer architecture trained on the extensive SA-1B dataset, has demonstrated remarkable performance and opened the doors to an exciting new task known as Segment Anything. With its generalization and potential, it has the potential to become the cornerstone of a wide range of future vision applications.
However, not everything in SAM is perfect. This kind of power comes at a cost, and for SAM, it’s complexity. It is computationally too demanding, which makes it difficult to apply in practical scenarios. Computing resource requirements are associated with transformer models, particularly Vision Transformers (ViT), which form the core of the SAM architecture.
Is there a way to make SAM faster? The answer is yes, and it’s called FastSAM.
FastSAM It is proposed to meet the high demand for industrial applications of the SAM mode. It manages to accelerate the execution of SAM with a significant margin and allows its application in practical scenarios.
FastSAM decouples the segment anything task into two sequential stages: segment all instances and prompt-driven selection. The first stage employs a convolutional neural network (CNN)-based detector to produce segmentation masks for all image instances. In the second stage, it generates the region of interest corresponding to the user’s notice. Taking advantage of the computational efficiency of CNNs, FastSAM demonstrates the possibility of achieving a model of any segment in real time without compromising the quality of performance.
FastSAM is based on YOLOv8-seg, an object detector equipped with an instance segmentation branch inspired by the YOLACT method. By training this CNN detector on only 2% of the SA-1B dataset, FastSAM achieves performance comparable to SAM while drastically reducing computational demands. The proposed approach proves effective in multiple downstream segmentation tasks, including object proposing on MS COCO, where FastSAM outperforms SAM in terms of average recovery over 1000 propositions while running 50 times faster on a single NVIDIA RTX 3090.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She received her Ph.D. He graduated in 2023 from the University of Klagenfurt, Austria, with his dissertation titled “Video Coding Improvements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.