The image segmentation landscape has been profoundly transformed with the introduction of the Segment Anything Model (SAM), a paradigm known for its remarkable zero-shot segmentation capabilities. The implementation of SAM in a wide range of applications, from augmented reality to data annotation, underscores its usefulness. However, the computational intensity of SAM, particularly the demand on its image encoder of 2973 GMACs per image in inference, has limited its application in scenarios where time is of the essence.
The quest to improve the efficiency of SAM without sacrificing its formidable accuracy has led to the development of models such as MobileSAM, EdgeSAM, and EfficientSAM. Unfortunately, these models, while reducing computational costs, experienced drops in performance, as shown in Figure 1. To address this challenge, the introduction of EfficientViT-SAM uses the EfficientViT architecture to revamp the SAM image encoder. This adaptation preserves the integrity of SAM's lightweight advertisement encoder and mask decoder architecture, culminating in two variants: EfficientViT-SAM-L and EfficientViT-SAM-XL. These models offer a nuanced balance between operational speed and segmentation accuracy, trained end-to-end using the comprehensive SA-1B dataset.
EfficientViT is the core of this innovation, a vision transformer model optimized for high-resolution dense prediction tasks. Its unique multi-scale linear attention module replaces traditional softmax attention with ReLU linear attention, significantly reducing the computational complexity from quadratic to linear. This efficiency is achieved without compromising the model's ability to globally perceive and learn multi-scale features, a fundamental improvement detailed in the original EfficientViT publication.
The architecture of EfficientViT-SAM, particularly the EfficientViT-SAM-XL variant, is meticulously structured into five stages. The early stages employ convolution blocks, while the later stages integrate EfficientViT modules, culminating in a feature fusion process that feeds the SAM head, as illustrated in Figure 2. This architectural design ensures seamless feature fusion. of multiple scales, improving the segmentation of the model. ability.
The EfficientViT-SAM training process is as rigorous as it is innovative. Starting with the distillation of the SAM-ViT-H image embeddings into EfficientViT, the model undergoes end-to-end training on the SA-1B dataset. This phase incorporates a combination of box and point cues, employing a combination of focal loss and dice to tune model performance. The training strategy, including the choice of cues and loss function, ensures that EfficientViT-SAM not only learns efficiently but also adapts to various segmentation scenarios.
The excellence of EfficientViT-SAM is not merely theoretical; Its empirical performance, particularly in runtime efficiency and zero-shot segmentation, is compelling. The model demonstrates a 17- to 69-fold speedup compared to SAM, with a significant performance advantage despite having more parameters than other speedup efforts, as shown in Table 1.
The zero-shot segmentation capability of EfficientViT-SAM is evaluated by meticulous testing on COCO and LVIS datasets, employing single-point and frame instance segmentation. The performance of the model, as detailed in Tables 2 and 4, shows its superior segmentation accuracy, particularly when using additional point cues or ground truth bounding boxes.
Furthermore, segmentation on the Wild benchmark further validates the robustness of EfficientViT-SAM in zero-shot segmentation on diverse data sets, with performance results summarized in Table 3. Qualitative results, depicted in Figure 3 , highlight the ability of EfficientViT-SAM to segment objects of different sizes. sizes, affirming its versatility and superior segmentation capacity.
In conclusion, EfficientViT-SAM successfully merges the speed of EfficientViT into the SAM architecture, resulting in a substantial efficiency gain without sacrificing performance. This opens possibilities for broader applications of powerful segmentation models, even in resource-constrained scenarios. To facilitate and encourage further research and development, pre-trained EfficientViT-SAM models have been made open source.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 37k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Vineet Kumar is a Consulting Intern at MarktechPost. She is currently pursuing her bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.
<!– ai CONTENT END 2 –>