Vision language models (VLMS) face a critical challenge to achieve solid generalization beyond their training data while computing resources and profitable efficiency are maintained. The approaches, such as the chain of thought, the fine adjustment (COT-Sft), often lead to the overjuste, where the models work well in the data seen but fight with new invisible scenarios. This limitation reduces its effectiveness in applications that require adaptability, such as autonomous systems, medical images and visual reasoning tasks. In addition, the predominant assumption is that increasing the size of the model is the key to improving performance. The need for a more efficient training paradigm that improves generalization, minimizes the overjuste and reduces computational costs has become crucial to advance in VLM.
Deep Agent released R1-V to solve some of the previous concerns. This new reinforcement learning approach improves the capacity of generalization of VLM while profitable. This approach demonstrates how Reinforcement learning with verifiable rewards (RLVR) It can overcome traditional COT-Sft in effectiveness and robustness when it comes to out-of-distribution data (OOD).
The main objective of the R1-V approach is to improve VLM's ability to generalize beyond its training data sets. R1-V addresses this problem using reinforcement learning techniques that guide the model to learn generalizable skills instead of memorizing training examples. In particular, it focuses on VLM teaching to develop robust visual counting skills, an essential skill in many ai applications, including image recognition, autonomous systems and visual reasoning.
The highlight of R1-V is its training efficiency. Despite using a relatively small model with only 2 billion parameters, R1-V works better than a parameter model of 72 billion significantly higher in OOD tests. This shows that the size of the model is not the only determinant of performance; The training methodology and reinforcement learning strategies are crucial to improve the capacities of a model.
R1-V was trained in eight A100 GPU for 30 minutes, with a total computational cost of only $ 2.62. This profitability makes it an attractive alternative for researchers and developers who want to achieve high performance without extensive computational resources. R1-V also stands out due to its dependence on a curing training set. The model was trained using Clevr-70k and Visual Reasoning Data sets designated by R1specifically designed to promote visual reasoning and robust decision making. The use of these data sets guarantees that the model develops a deep understanding of visual relationships and logical reasoning instead of simply learning to recognize the patterns of a specific data set.
In conclusion, the development of R1-V admits open source ai investigation by making its code, pesos of models, data sets and training scripts available publicly. This allows the research community to refine and improve vision language modeling. The R1-V reinforcement learning approach allows fast patterns and data structures. It leads to high performance with a minimum computational cost. This challenges the assumption that extensive training and mass data sets are necessary for the latest generation performance. On the other hand, efficient training methodologies can reduce computational demands while mainly maintained or exceeds traditional results.
Verify he Github page. All credit for this investigation goes to the researchers of this project. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 75K+ ml of submen.
Marktechpost is inviting companies/companies/artificial intelligence groups to associate for their next ai magazines in 'Open Source ai in production' and 'ai de Agent'.
Sana Hassan, a consulting intern in Marktechpost and double grade student in Iit Madras, passionate to apply technology and ai to address real world challenges. With great interest in solving practical problems, it provides a new perspective to the intersection of ai and real -life solutions.