Large Vision and Language Models, or LVLMs, can interpret visual signals and provide simple responses for users to interact with. This is achieved by cleverly fusing large language models (LLMs) with fine-tuning of large-scale visual instruction. However, LVLMs only need handcrafted or LLM-generated datasets to be aligned using supervised fine-tuning (SFT). Although it works well to change LVLMs from caption generators to instruction-following models, LVLMs can still produce hurtful, ill-intentioned, or unhelpful responses. This suggests that they still need to be more aligned with human preferences. Furthermore, while previous research encourages organizing visual instruction tuning samples in multi-turn forms, the ability of LVLMs to interact is limited by weak connections and interdependence between different turns. Here, interaction ability assesses how well LVLMs can adjust their responses using prior context in multi-turn interactions. These two drawbacks limit the practical use of LVLMs as visual aids.
The research team from SRI International and the University of Illinois Urbana-Champaign presents DRESS, an LVLM that is taught exclusively using natural language feedback (NLF) produced by LLM in this work (see Figure 1). The research team instructs the LLMs to provide detailed feedback on the LVLM responses by providing them with specific rules and extensive photo annotations. In line with the process of creating human-aligned LLMs, this feedback annotation considers the three H-criteria: usefulness, honesty, and harmlessness. The feedback measures the overall quality of responses based on 3H criteria and provides a numerical score and NLF. The research team's method divides NLF into critique and refinement. This is a novel classification. While the Refinement NLF provides precise recommendations to LVLMs on how to improve their responses to align with the ground truth reference, the Critique NLF evaluates the strengths and flaws of the responses. This classification provides a natural application of two types of NLF to make LVLMs more acceptable to humans and improve their interaction capabilities.
The research team generalizes the conditional reinforcement learning technique to meet the non-differentiable character of NLF and trains LVLMs with such feedback. Specifically, the research team uses linguistic modeling (LM) loss on responses to train DRESS to generate equivalent responses conditioned to the two NLFs. The research team refines DRESS by analyzing and interpreting numerical results to better match user preferences. Through multi-turn interactions during inference, the research team trains DRESS to learn the meta-skill of refining its original responses using NLF refinement.
The research team evaluates DRESS in multi-turn interactions, adverse prompts to evaluate harmlessness, image captions to evaluate honesty, and open-ended visual questions to evaluate usefulness. Experimental findings show that compared to previous LVLMs, DRESS can provide responses that align with human values and have superior interaction capabilities that allow it to learn from feedback and modify responses as needed efficiently. To the best of their knowledge, the research team's effort is the first to address interactability and the three 3H criteria for LVLMs.
The contributions of the research team are summarized below:
• The research team suggests using natural language feedback (NLF), which can be divided into NLF critique and refinement, to improve the ability of LVLMs to interact and align with human preferences.
• By training the model to provide conditional matching responses to the NLF, the research team generalizes the conditional reinforcement learning method to successfully accommodate the non-differentiable NLF. Compared to the previous SOTA, the research team's suggested model, DRESS, demonstrates relative improvements of 9.76%, 11.52%, and 21.03% based on a systematic assessment of the alignment of usefulness, honesty, and harmlessness.
• The research group generates and makes available to the public 63,000 examples of NLF in annotated language, including 3H features. Additionally, the research team created a publicly available data set of 4.7 thousand samples for safety alignment and LVLM evaluation.
Review the Paper and Data set. All credit for this research goes to the researchers of this project. Also, don't forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor's degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>