The supervised fine adjustment (SFT) is the standard training paradigm for large language models (LLM) and graphic user interface agents (GUI). However, SFT requires high quality labeled data sets, resulting in prolonged training periods and high computational expenses. This extensive data dependence creates bottlenecks in Ia's development workflows. In addition, existing GUI agents based on VLM trained through SFT show performance deficiencies when they face out of dominance scenarios, severely limiting their practical utility in various real -world applications. Rule -based reinforcement learning (RL) or fine reinforcement adjustment (RFT) is a promising alternative, which requires only dozens to thousands of samples instead of massive data sets.
Several approaches have been developed to advance Gui agents and optimize their training. The APPAGENT series and mobile-agent integrate commercial models such as GPT for planning and prediction tasks, but they depend largely on rapid engineering and the collaboration of multiple agents, which require a careful manual design for optimal performance. Therefore, researchers have adjusted the smallest open source MLM in specific gui data sets to create specialized agents. RL -based on rules has become an efficient alternative to traditional training paradigms and uses reward functions based on predefined rules that focus on the final results while allowing models to learn organically reasoning processes. The technique is effective even in smaller models and extends to multimodal models through specific tasks for visual tasks.
Vivo ai Lab and MMLAB @ Cuhk researchers have proposed UI-R1 to improve the reasoning capabilities of Multimodal LLMS for the prediction tasks of the Gui's action through Deepseek R1 Style RL. Researchers present the first exploration of how RL based on rules can improve MLLM reasoning for graphic action prediction of the user interface. A small but high quality data set is cured with 136 challenging tasks in five common types of mobile device action. The optimization of the model is enabled through policy -based algorithms by introducing an action reward based on unified rules, specifically the optimization of group relative policies (GROPO). This approach has shown great effectiveness for domain and outdated tasks, with significant improvements in the precision of the type of action and the precision of the base in the QWEN2.5-VL-3b base model.
The grounding capacities of the system are evaluated using two specialized reference points: Screenspot, which evaluates the ground connection of the GUI on mobile, desktop and web platforms, and Screenspot-Pro, which focuses on high-resolution professional environments with tasks noted by experts that cover 23 applications, five industries and three operating systems. In addition, the model undergoes a single -step action prediction tests based on low -level instructions using a selected androidcontrol subset, which introduces a wider range of action types beyond the screens reference point The research methodology also explores the critical relationship between the size of the training data and the performance of the model, comparing random sampling versus the selection based on the difficulty in the selection of training data.
The UI-R1 improves the ground connection capacity of the 3B model in 20% in Screenspot and 6% in Screenspot-Pro, surpassing most of the 7B models in both reference points. UI-R1 achieves a performance comparable to the latest 7B models, such as Aguuvis and OS-Atlas, despite the fact that the models are trained using SFT in larger labeled data sets. When compared directly to the QWEN2.5-VL model (ZS), UI-R1 shows a 15% improvement in the prediction precision of the type of action and a 20% improvement in the ground connection accuracy of the click element using only 136 training data points. The research also reveals that although the model performance improves with an increase in training data, this relationship is gradually saturated and the selection method based on difficulty consistently exceeds random selection.
In conclusion, the researchers introduced the UI-R1 framework, which successfully extends action prediction tasks based on gui rules, providing a scalable and efficient alternative to the traditional SFT. It uses a new reward function that simultaneously evaluates both the type of action and arguments, effectively reducing the complexity of tasks while improving learning efficiency. Despite using only more than 130 training samples of the mobile domain, UI-R1 achieves remarkable performance, showing strong generalization capabilities when applied to data sets outside domain on desktop and web platforms. The exceptional adaptability of UI-R1, the efficiency of the data and the effectiveness in the management of specialized tasks establish a future promising direction in the development of multimodal Gui agents.
Verify he Paper. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 85k+ ml of submen.

Sajad Ansari is an undergraduate last year of Iit Kharagpur. As an enthusiastic of technology, it deepens the practical applications of ai with an approach to understanding the impact of ai technologies and their implications of the real world. Its objective is to articulate complex concepts of ai in a clear and accessible way.