The field of “Visual Question Response” (VQA) focuses on developing AI systems that can correctly answer queries formulated in a conversational tone and relate to a given image. If a system can achieve this goal, it has been shown to have a deeper understanding of images in general, as it must be able to answer questions about various aspects of an image.
Large-scale annotated data sets for VQA have recently fueled progress in multimodal vision and language learning. The models need to analyze scenes and discover useful associations between the two modalities in order to give appropriate responses to queries. Recently, transformer-based vision-language (VL) models have achieved remarkable accuracy on reference VQA datasets by having been previously trained on large-scale multimodal corpora. Visual quality assessment generally requires more than just a literal interpretation of an image (eg, “A plate with meat, potatoes, and bread”), but also the ability to draw conclusions about the context of the image (eg, eg, “The dish is probably located in a restaurant”).
Inferences like these are made by humans based on their experience and common sense. Most of the current approaches are based on the knowledge of the world implicit in linguistic models, which frequently suffers from insufficient precision and comprehensiveness (supposed nature of common sense information). The overrepresentation of exceptional facts (such as “people die in accidents”) in textual corpora, at the expense of minor truths infrequently addressed and understood by all (such as “people eat”), is a problem for knowledge of meaning. common learned from texts.
These questions go beyond simple image recognition to involve factual knowledge or common sense. As a result, neurosymbolic approaches have emerged that merge transformer-based representations with knowledge bases (KBs). However, it can be difficult to retrieve useful information directly from a KB due to gaps in coverage and the fact that the data in the KB only applies to specific scenarios.
A new study from the Vector Institute for AI and the University of British Columbia presents VLC-BERT (Vision-LanguageCommonsense BERT), a paradigm for integrating the VL-BERT-based VisionLanguage transformer with contextualized commonsense knowledge. Unlike most knowledge-based VQAs that typically use a retrieval paradigm, the proposed approach uses COMET, a language model trained on common-sense knowledge graphs. Create contextualized common sense inferences in the question phrase paired with image object labels.
Researchers improve sentence transformers to prioritize, filter, and incorporate common-sense conclusions. To incorporate the filtered inferences into VLC-BERT, they use an attention-based merging mechanism trained to zero in on the most relevant inferences for each query. Some questions may require only visual, factual, or direct knowledge; therefore, common sense may not always be required.
The team used weak supervision to help determine whether or not common sense knowledge is useful, reducing the need to inject noisy knowledge into these scenarios. Their tests on the harsh OK-VQA and A-OKVQA data sets show that using common sense is always beneficial for information-intensive visual Q&A tasks. They also highlight in one of their tweets that pre-training their VQA 2.0 models helps overcome the challenge of training on smaller data sets like OK-VQA.
The team discovered the following caveats in evaluating VLC-BERT on their models and data sets:
- Object tags are not enough to answer some queries, as they cannot recognize or make connections between many objects or events in an image.
- The model may lose some information due to compression introduced by SBERT and MHA while reducing the size of common sense inferences.
- Large-scale models like GPT-3 perform better than yours, indicating that your model is limited by COMET and the knowledge bases it was trained on.
The team believes this is the first step in evaluating the feasibility of including generative common sense and investigating methods to determine when common sense is required. They plan for COMET to take visual context into account with respect to numerous elements and events in the future. Furthermore, they aim to analyze the feasibility of multi-hop retrofit with COMET to connect question and image based extensions.
review the Paper Y Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.