The transfer of pre-trained vision backbones has improved performance on various vision tasks, as well as natural language processing. Larger data sets, scalable infrastructures, and innovative training techniques have fueled its growth. Despite this, language models have significantly outperformed vision models in terms of large-scale emergent capabilities. The densest language model has 540B parameters, the densest vision model has only 4B parameters, and a moderately parameterized model for an entry-level competitive language model often comprises more than 10B parameters.
Language models have over a trillion parameters, yet the largest recorded low vision models only have 15B. The sparse models show the same trend. In this work the largest dense ViT model to date, ViT-22B, is presented. They identify pathological training instabilities that prevent scaling the default recipe to 22B parameters and show architectural improvements that allow it. In addition, they carefully design the model to provide model-parallel training with never-before-seen efficiency. A full set of assessment tasks, ranging from classification to dense output tasks, is used to determine if ViT-22B meets or exceeds the current state of the art.
With 22 billion parameters, ViT-22B is the largest vision transformer model available. For example, ViT-22B achieves 89.5% accuracy on ImageNet even when used as a frozen visual feature extractor. Achieve 85.9% accuracy on ImageNet in the zero-shot situation using a text tower trained to match these visual attributes. The model is also an excellent instructor; Using it as a distillation target, they educate a ViT-B student who scores an industry-leading 88.6% on ImageNet. Reliability improvements, uncertainty estimates, and fairness trade-offs accompany this performance. Finally, the model’s properties are more like the way people see things, resulting in a never-before-seen shape bias of 87%.
ViT-22B is a transformer-based encoder model with parallel layers, query/key (QK) normalization, and omitted biases to increase training efficiency and stability at scale. Its architecture is similar to that of the original Vision Transformer.
Overlapping layers. Instead of applying the Attention and MLP blocks sequentially as in the traditional Transformer, ViT-22B does so in parallel. The linear projections of the MLP and the attention blocks allow a different parallelization.
QK normalization. After just a few thousand steps, they saw divergent training loss while increasing ViT beyond previous efforts. In particular, the models with some 8B parameters showed a similar instability. It was caused by abnormally high logit values of attention, which produced attention weights that were practically one-hot and had almost no entropy. They use the method of applying LayerNorm on the queries and keys before calculating the care of the dot product to address this and exclude biases from the LayerNorms and QKV projections. After PaLM, all LayerNorms were applied without bias or centering, and bias terms were removed from the QKV projections.
They demonstrate how the original design can be improved to achieve high hardware usage and training stability, producing a model that outperforms SOTA on several benchmarks. In particular, excellent performance can be achieved by creating keystones with the model frozen and then layering thin layers on top of those keystones. Their analyzes further demonstrate that ViT-22B outperforms previous models in fairness and robustness and is more similar to people in terms of shape and texture bias. The code and dataset have not yet been released.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 14k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.