Transformers have demonstrated remarkable abilities in various natural language processing (NLP) tasks, including language modeling, machine translation, and text generation. These neural network architectures have been extended to achieve significant advances in NLP.
One of the main advantages of the Transformer architecture is its ability to capture long-range dependencies in text, which is crucial for many NLP tasks. However, this comes at the cost of high computational requirements, making it difficult to train large transformer models.
Researchers have been pushing the scale limits of Transformers to larger models in recent years, using more powerful hardware and distributed training techniques. This has led to significant improvements in the performance of the language model on various benchmarks, such as the GLUE and SuperGLUE benchmarks.
Large Language Models (LLMs) such as PaLM and GPT-3 have shown that scaling transformers to hundreds of billions of parameters improves performance and unlocks emerging capabilities. However, the largest dense models for image understanding have only reached 4 billion parameters, despite research indicating that multimodal models like PaLI benefit from scaling their language and vision models. Therefore, the scientists decided to take the next step in scaling the Vision Transformer, motivated by the results of scaling the LLMs.
He article introduces ViT-22B, the largest dense vision model introduced to date, with 22 billion parameters, 5.5 times larger than the previous largest vision backbone, ViT-e, with 4 billion parameters . To achieve this scaling, the researchers incorporate ideas of scaling text models such as PaLM, which include improvements in training stability through QK normalization and training efficiency through a novel approach called asynchronous parallel linear operations. ViT-22B could be trained on Cloud TPU with high hardware utilization with its modified architecture, efficient sharding recipe, and custom implementation. The model advances the state of the art in many vision tasks with frozen representations or full fit. Furthermore, it has been used successfully in PaLM-e, which demonstrated that a large model combining ViT-22B with a language model could significantly advance robotics tasks.
The researchers built on advances in large language models such as PaLM and GPT-3 to create ViT-22B. They used parallel layers, where the attention and MLP blocks are executed in parallel instead of sequentially as in the standard Transformer architecture. This approach was used in PaLM and reduced training time by 15%.
ViT-22B omits biases in the QKV and LayerNorms projections, increasing utilization by 3%. Chunking is necessary for models of this scale, and the team chunks both model parameters and activations. They developed an asynchronous parallel linear operations approach, where the communication of activations and weights between devices occurs simultaneously as computations in the matrix multiplication unit, minimizing the wait time on incoming communication and increasing device efficiency.
Initially, the new scale of the model resulted in severe training instabilities. The normalization approach of Gilmer et al. (2023, forthcoming) resolved these issues, allowing smooth and stable model training.
ViT-22B was tested with human comparison data and had state-of-the-art alignment with human visual object recognition. Like humans, the model is highly shape biased and primarily uses the shape of the object to inform classification decisions. This suggests a greater similarity to human perception compared to standard models.
ViT-22B is the largest vision transformer model with 22 billion parameters and achieved cutting-edge performance with critical architecture changes. It shows greater similarities with human visual perception and offers benefits in fairness and robustness. It uses frozen models to produce embeddings, and training thin layers on top produces excellent performance in various benchmarks.
review the Paper and google blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 17k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.
🔥 Must Read: What is AI Hallucination? What goes wrong with AI chatbots? How to detect an amazing artificial intelligence?