Convolutional Neural Networks (CNN) have been the backbone of systems for machine vision tasks. They have been the reference architecture for all kinds of problems, from object detection to super-resolution images. In fact, famous leaps (eg AlexNet) in the deep learning domain have been made possible by convolutional neural networks.
However, things changed when a new architecture based on Transformer models, called Vision Transformer (ViT), showed promising results and outperformed classical convolutional architectures, especially for large data sets. Since then, the field has been looking to enable ViT-based solutions to problems that have been addressed with CNNs for years.
The ViT uses self-service layers to process images, but the computational cost of these layers would scale quadratically with the number of pixels per image if applied naively at the per-pixel level. Therefore, the ViT first splits the image into several patches, embeds them linearly, and then applies the transformer directly to this collection of patches.
Following the success of the original ViT, many works have modified the ViT architecture to improve its performance. Replacing self-attention with novel operations, making other small changes, etc. Although, despite all these changes, almost all ViT architectures follow a common and simple template. They maintain the same size and resolution throughout the network and exhibit isotropic behavior, achieved by implementing channel and spatial mixing in alternate steps. In addition, all grids employ patch inlays that allow for downsampling at the start of the grid and facilitate smooth and consistent mix design.
This patch-based approach is the common design choice for all ViT architectures, simplifying the overall design process. So, here comes the question. Is the success of vision transformers primarily due to patch based rendering? Or is it due to the use of advanced and expressive techniques such as self-attention and MLP? What is the main factor that contributes to the superior performance of vision transformers?
There is a way to find out, and it’s called ConvMixer.
ConvMixer is a convolutional architecture developed to analyze the performance of ViTs. It is very similar to ViT in many ways: it works directly on image patches, maintains a constant resolution throughout the network, and separates the channel mix from the spatial mix of information in different parts of the image.
However, the key difference is that the ConvMixer it achieves these operations using standard convolutional layers, as opposed to the self-service mechanisms used in the Vision Transformer and MLP-Mixer models. In the end, the resulting model is cheaper in terms of computing power because point and depth convolution operations are cheaper than self-service and MLP layers.
Despite its extreme simplicity, ConvMixer it outperforms “standard” computer vision models such as ResNets of similar parameter counts and some corresponding ViT and MLP-Mixer variants. This suggests that the patch-based isotropic blending architecture is a powerful primitive that works well with almost any choice of well-behaved blending operations.
ConvMixer is an extremely simple class of models that independently mix the spatial and channel locations of patch embeddings using only standard convolutions. It can provide a substantial performance boost that can be achieved by using large kernels inspired by the large responsive fields of ViT and MLP-Mixers. Finally, ConvMixer can serve as the foundation for future patch-based architectures with novel operations
review the Paper. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.