Vision Transformers (ViT) are a type of neural network architecture that has become very popular for vision tasks such as image classification, semantic segmentation, and object detection. The main difference between the vision transformers and the originals was the replacement of the discrete text tokens with continuous pixel values extracted from image patches. ViTs extracts features from the image by looking at different regions of the image and combining them to make a prediction. However, despite recent widespread use, little is known about inductive biases or the features that ViTs tend to learn. While feature visualizations and image reconstructions have been successful in understanding how convolutional neural networks (CNNs) work, these methods have not been as successful in understanding ViTs, which are difficult to visualize.
The latest work by a group of researchers from the University of Maryland-College Park and New York University expands the ViT literature with an in-depth study of its behavior and internal processing mechanisms. The authors established a visualization framework to synthesize images that maximally activate neurons in the ViT model. In particular, the method involved taking gradient steps to maximize feature activations by starting with random noise and applying various regularization techniques, such as penalizing total variance and using augmentation assembly, to improve the quality of the generated images.
The analysis found that patch tokens in ViT preserve spatial information across all layers except the last attention block, which learns a token blending operation similar to the average pooling operation widely used in CNNs. The authors observed that the representations remain local, even for individual channels in deep layers of the network.
To this end, the CLS token appears to play a relatively minor role across the entire network and is not used for globalization until the very last layer. The authors demonstrated this hypothesis by performing inferences on images without using the CLS token in layers 1-11 and then inserting a value for the CLS token in layer 12. The resulting ViT could still successfully classify 78.61% of the set of ImageNet validation instead of the original 84.20%.
Thus, both CNNs and ViTs exhibit progressive feature specialization, where the earlier layers recognize basic image features such as color and edges, while the deeper layers recognize more complex structures. However, one important difference found by the authors concerns the reliance of ViT and CNN on foreground and background image features. The study found that ViTs are significantly better than CNNs at using background information in an image to identify the correct class and suffer less from background removal. Furthermore, ViT predictions are more resistant to deletion of high-frequency texture information compared to ResNet models (results visible in Table 2 of the paper).
Finally, the study also briefly discusses the representations learned by ViT models trained on the Contrastive Language Image Pretraining (CLIP) framework that connects images and text. Interestingly, they found that CLIP-trained ViTs produce deeper layered features activated by objects in clearly discernible conceptual categories, unlike classifier-trained ViTs. This is reasonable but surprising because the text available on the Internet provides targets for abstract and semantic concepts such as “morbidity” (examples are visible in Figure 11).
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Lorenzo Brigato is a Postdoctoral Researcher at the ARTORG center, a research institution affiliated with the University of Bern, and is currently involved in the application of AI to health and nutrition. He has a PhD. He graduated in Computer Science from the Sapienza University in Rome, Italy. His PhD thesis focused on image classification problems with poor data distributions across samples and labels.