The deep learning revolution in computer vision has moved from manually designed features to data-driven approaches, highlighting the potential to reduce feature biases. This paradigm shift aims to create more versatile systems that excel at various vision tasks. Although the Transformer architecture has proven effective on different data modalities, it still retains some inductive biases. Vision Transformer (ViT) reduces spatial hierarchy but maintains translation equivariance and locality through patch projection and position embeddings. The challenge lies in removing these remaining inductive biases to further improve model performance and versatility.
Previous attempts to address locality in vision architectures have been limited. Most modern vision architectures, including those aimed at simplifying inductive biases, still maintain locality in their design. Even visual features prior to deep learning, such as SIFT and HOG, used local descriptors. Efforts to eliminate locality in ConvNets, such as replacing spatial convolutional filters with 1×1 filters, resulted in performance degradation. Other approaches, such as iGPT and Perceiver, explored pixel-level processing, but faced efficiency challenges or fell short of performance compared to simpler methods.
Researchers from FAIR, Meta ai, and the University of Amsterdam challenge the conventional wisdom that locality is a fundamental inductive bias for vision tasks. They find that by treating individual pixels as tokens for the Transformer and using learned position embeddings, removing inductive locality biases leads to better performance than conventional approaches like ViT. They call this approach “Pixel Transformer” (PiT) and demonstrate its effectiveness in various tasks, including supervised classification, self-supervised learning, and image generation with diffusion models. Interestingly, PiT outperforms baselines equipped with inductive locality biases. However, researchers recognize that while locality may not be necessary, it is still useful for practical considerations such as computational efficiency. This study conveys a compelling message that locality is not an indispensable inductive bias for model design.
PiT closely follows the standard Transformer encoder architecture, processing an unordered set of input image pixels with learnable position embeddings. The input sequence is mapped to a sequence of representations through multiple layers of Self-Attention and MLP blocks. Each pixel is projected into a high-dimensional vector through a linear projection layer and a learnable token (cls) is added to the sequence. Content-independent position embeddings are learned for each position. This design eliminates inductive locality bias and makes the PiT permutation equivalent at the pixel level.
In empirical evaluations, PiT demonstrates competitive performance on various tasks. For imaging using diffusion models, PiT-L outperforms the reference DiT-L/2 in multiple metrics, including FID, sFID, and IS. The effectiveness of PiT generalizes well across different tasks, architectures, and operational representations. Furthermore, in the CIFAR100 results with 32×32 inputs, PiT substantially outperforms ViT. The researchers found that for PiT, self-supervised pre-training with MAE improves accuracy compared to training from scratch. The gap between ViT and PiT, with prior training, increases when moving from Tiny to Small models. This suggests that PiT can potentially scale better than ViT.
While PiT demonstrates that Transformers can work directly with individual pixels as tokens, practical limitations remain due to computational complexity. However, this exploration challenges the notion that locality is critical to vision models and suggests that patching is primarily a commercial efficiency heuristic useful for accuracy. This finding opens new avenues for designing next-generation models in computer vision and beyond, potentially leading to more versatile and scalable architectures that rely less on manually induced priors and more on learnable, data-driven alternatives.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram channel and LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 44k+ ML SubReddit
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>