The domain of computer vision has undergone significant advancement in the last decade, and this advancement can be mainly attributed to the emergence of Convolutional Neural Networks (CNNs). CNNs’ impeccable ability to process 2D data, thanks to its hierarchical feature extraction mechanism, was a key factor behind its success.
Modern CNNs have come a long way since their introduction. Updated training mechanisms, data augmentations, improved network design paradigms, and more. The literature is full of successful examples of these proposals that made CNNs much more powerful and efficient.
On the other hand, the open source aspect of the computer vision domain has contributed to significant improvements. Thanks to widely available pretrained large-scale visual models, feature learning became much more efficient; therefore starting from scratch was not the case for most vision models.
Today, the performance of a vision model is mainly determined by three factors: the chosen neural network architecture, the training method, and the training data. Advancement in any of these trios results in a significant increase in overall performance.
Of these three, innovations in network architecture have played the most importance in the advancement. CNNs eliminated the need for manual feature engineering by allowing the use of generic feature learning methods. Not too long ago, we had the advancement of transformative architectures in the domain of natural language processing and they were transferred to the domain of vision. Transformers were quite successful due to their great scalability in both data and model size. Then finally, in the last couple of years, the ConvNeXt architecture was introduced. He modernized traditional convolutional networks and showed us that pure convolution models could also scale.
However, we have a minor problem here. All of these “advances” were measured through a single computer vision task, monitored image recognition performance on ImageNet. It remains the most common method for exploring the design space of neural network architectures.
On the other hand, we have researchers looking for a different way to teach neural networks how to process images. Instead of using tagged images, they use a self-monitoring approach where the network has to figure out what’s in the image itself. Masked autoencoders are one of the most popular ways to achieve this. They are based on the masked language modeling technique, widely used in natural language processing.
It is possible to mix and match different techniques when training neural networks, but it is complicated. ConvNeXt can be combined with masked autoencoders. However, since masked autoencoders are designed to work best with transformers for processing sequential data, they may be too computationally expensive to use with convolutional networks. Also, the layout may not be compatible with convolutional networks due to the sliding window mechanism. Also, previous research has shown that it can be difficult to get good results when using self-supervised learning methods, such as masked autoencoders with convolutional networks. Therefore, it is crucial to note that different architectures may have different feature learning behaviors that may affect the quality of the end result.
This is where ConvNeXt V2 comes in. It is a co-design architecture that uses the masked autoencoder in the ConvNeXt framework to achieve results similar to those obtained using transformers. It is a step towards making mask-based self-supervised learning methods effective for ConvNeXt models.
Designing the masked autoencoder for ConvNeXt was the first challenge and they solved it cleverly. They treat the masked input as a set of sparse patches and use sparse convolutions to process only the visible parts. Furthermore, the decoder part of the transformer in the masked autoencoder is replaced with a single ConvNeXt block, making the entire structure fully convolutional, which in turn improves pretraining efficiency.
Finally, a global response normalization layer is added to the framework to improve cross-channel feature competition. However, this change is effective when the model is pretrained with masked autoencoders. Therefore, reusing a fixed architecture design of supervised learning may be suboptimal.
ConvNeXt V2 improves performance when used in conjunction with masked autoencoders. It is specifically designed for self-monitored learning tasks. The use of fully convolutional masked autoencoder pretraining can significantly improve the performance of pure convolutional networks.
review the Paper Y Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.