Several recent vision and language models have demonstrated remarkable multimodal generation abilities. But they typically require training huge models on huge data sets. The researchers present Prismer, a data- and parameter-efficient vision language model that uses a set of domain experts, as a scalable alternative. By inheriting most of the network weights from publicly available pre-trained domain experts and freezing them during training, Prismer only requires training a few components.
The generalization capabilities of large pretrained models are exceptional in many different tasks. However, these functions come at a high price, as they require a large amount of training data and computational resources for training and inference. Models with hundreds of billions of trainable parameters are common in the language domain and typically require a computing budget on the yottaFLOP scale.
Problems related to learning visual language are more difficult to solve. Although this field is a superset of language processing, it also requires expertise in visual and multimodal thinking. Using its projected multimodal signals, Prismer is a data-efficient vision language model that uses a wide range of pre-trained experts. You can handle visual responses to questions and picture captions, two examples of vision and language reasoning tasks. Using a prism as an example, Prismer divides a piece of general reasoning into several smaller, more manageable parts.
The researchers developed a visually conditioned autoregressive text generation model for Two of Prismer’s most important design features are vision-only. Web-scale language-only models for knowledge to build our core network backbones and (ii) modality-specific vision experts that encode multiple types of visual information, from low-level vision cues like depth to depth cues. high-level view as instance and semantics. tags, as auxiliary knowledge, directly from their corresponding network outlets. The researchers developed a visually conditioned autoregressive text generation model to better utilize multiple previously trained domain experts for exploratory vision and language reasoning tasks.
Although Prismer has only been trained on 13 million publicly available alt text/image data examples, it shows strong multimodal reasoning performance on tasks such as image captioning, image classification, and visual question answering, which is competitive with many of the more advanced. art vision language models. The researchers conclude with a thorough investigation of Prismer’s learning habits, where the researchers find several good characteristics.
Model design:
The Prismer model, shown in its encoder-decoder transformer version, relies on a large group of already trained subject matter experts to speed up the training process. A visual encoder plus an autoregressive language decoder make up this system. The vision encoder receives a sequence of RGB and multimodal labels (depth, surface normal, and anticipated segmentation labels from the previously trained frozen experts) as input. Produces a sequence of RGB and multimodal functions as output. As a result of this cross-attention training, the language decoder is conditioned to generate a string of text tokens.
advantages:
- The Prismer model has several benefits, but one of the most notable is that it uses data extremely efficiently while training. Prismer builds on pretrained vision-only and language-only backbone models to achieve this goal with a significant decrease in GPU hours required to achieve performance equivalent to other next-generation vision and language models. One can use these pre-trained parameters to use the massive amounts of knowledge available at web scale.
- The researchers also developed a multimodal signal input for the vision encoder. The created multimodal auxiliary knowledge can better capture the semantics and information about the input image. Prismer’s architecture is optimized to maximize the use of trained experts with few trainable parameters.
The researchers have included two varieties of specialists pre-trained in Prismer:
- Spine specialists The pretrained models responsible for translating text and images into a meaningful sequence of tokens are called “vision-only” and “language-only” models, respectively.
- Depending on the data used in their training, Discourse Models facilitators can label tasks in a variety of ways.
Properties
- The more trained people there are, the better the results will be. As the number of modality specialists at Prismer grows, their performance improves.
- More Skilled Practitioners, Better Results Researchers replace a fraction of the predicted depth labels with random noise taken from a uniform distribution to create a corrupted depth expert and assess the effect of expert quality on Prismer performance.
- Resistance to Negative Feedback The findings further demonstrate that Prismer’s performance is stable when noise prediction experts are brought in.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.