The development of neural networks for visual recognition has long been a fascinating but difficult topic in computer vision. Recently suggested vision transformers replicate the human attention process by using attention operations on each patch or unit to dynamically interact with other units. Convolutional Neural Networks (CNNs) build features by applying convolutional filters to each unit of images or feature maps. To perform operations intensively, convolution and transformer-based architectures must traverse each unit, such as a pixel or patch in the grid map. The sliding windows that give rise to this unit-intensive traversal reflect the idea that foreground elements can be consistently displayed over their spatial locations in an image.
However, they do not have to look at all aspects of a situation to identify it, since they are human. Instead, they can quickly identify textures, edges, and high-level semantics within these regions after broadly identifying discriminatory areas of interest with numerous gazes. Contrast this with today’s visual networks, where it is customary to explore each visual unit thoroughly. At higher input resolutions, the dense paradigm incurs exorbitant computing costs but does not explicitly reveal what a vision model sees in an image. In this study, authors from NU Singapore’s Show Lab, Tencent AI lab, and Nanjing University suggest a new vision architecture called SparseFormer to investigate sparse visual recognition by accurately mimicking human vision.
A lightweight early convolution module in SparseFormer extracts image features from a given image. In particular, early on, the SparseFormer learns to represent an image via latent transformers and a very small number of tokens (eg up to 49) in latent space. Each latent token has a region of interest (RoI) description that can be refined in several stages. To generate latent token embeds iteratively, a latent focus transformer modifies token RoIs to focus on foregrounds and sparsely recovers image features based on these token RoIs. SparseFormer feeds tokens with these area properties into a larger, deeper network or a typical transformer encoder in latent space for accurate recognition.
The restricted tokens in the latent space are the only ones to perform the transformation operations. It is appropriate to refer to its architecture as a sparse solution for visual identification, since the number of latent tokens is extremely small and the feature sampling procedure is sparse (ie, based on direct bilinear interpolation). Except for the initial convolution component, which is lightweight in design, the overall computing cost of SparseFormer is almost unrelated to input resolution. Additionally, SparseFormer can be fully trained on classification signals alone without any additional prior training in signal location.
SparseFormer aims to investigate an alternative paradigm for vision modeling as a first step towards sparse visual recognition rather than providing cutting-edge results with bells and whistles. In the difficult ImageNet classification benchmark, SparseFormer still achieves very encouraging results comparable to dense equivalents but at reduced computational cost. Memory footprints are smaller and throughputs are higher than dense architectures because most SparseFormer operators operate on tokens in latent space rather than dense image space. After all, the number of tokens is limited. This results in a better precision performance tradeoff, especially in the low compute region.
Video categorization, which is more data-intensive and computationally expensive for dense vision models but appropriate for the SparseFormer architecture, can be added to the SparseFormer architecture due to its simple design. For example, with ImageNet 1K training, Swin-T with 4.5G FLOP achieves 81.3 with a top performance of 726 images/s. In contrast, the compact variation of SparseFormer with 2.0G FLOP achieves an accuracy of 81.0 top-1 with a throughput of 1270 images/s. SparseFormer visualizations demonstrate their ability to distinguish between foregrounds and backgrounds using only end-to-end sort signals. They also discuss various techniques for extending SparseFormer to improve performance. Its expansion of SparseFormer into video classification produces promising performance with less computation than dense architectures, according to experimental findings in the difficult Kinetics-400 video classification benchmark. This demonstrates how the suggested sparse view architecture works well when given denser input data.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.