Designing GUI agents that perform human-like tasks in graphical user interfaces faces a critical hurdle: collecting high-quality trajectory data for training. Existing methods rely on expensive and time-consuming human supervision or the generation of synthetic data, which can hardly reflect the diversity and dynamics of the real world. These limitations significantly limit the scalability and effectiveness of GUI agents and prevent them from acting autonomously and adapting to diverse and dynamic environments.
Traditional data acquisition for GUI agents is generally based on task-oriented methods. Human annotation is a labor-intensive process that involves designing tasks and annotating trajectories. Although synthetic data reduces dependence on humans, it relies on predefined high-level tasks, which limit the scope and scale of the data. The presence of errors in intermediate steps or conflicting objectives in the task results in inconsistent trajectories and therefore decreases the quality of the training data. As mentioned above, these constraints limit the generalization ability of agents to work effectively in dynamic or unknown environments.
Researchers from the Shanghai ai Laboratory, the University of Hong Kong, Johns Hopkins University, Shanghai Jiao Tong University, Oxford University, and the Hong Kong University of Science and technology propose OS-Genesis, an innovative strategy to address these challenges through interaction. Reverse task synthesis. Unlike default tasks, the exploratory mode of GUI agents involves interaction by clicking, scrolling, and typing on GUI elements for environments. In a retrospective analysis, these interactions are transformed into low-level instructions and contextualized as high-level tasks. Maintains data quality through a TRM, scoring synthesized trajectories according to dimensions of coherence, logical flow, and integrity. Even partial but meaningful data can be trained with this approach. By bridging the gap between abstract instructions and the dynamic nature of GUIs, this framework significantly improves the quality and diversity of training data while eliminating the need for human supervision.
The OS-Genesis process consists of several integral components. First, the system autonomously explores dynamic elements of the GUI, recording transitions between pre- and post-action states to collect critical data for task synthesis. These transitions are then transformed into detailed low-level instructions with the help of models like GPT-4o. These instructions are incorporated into high-level comprehensive objectives related to the general intention of the users, thus achieving semantic depth. The synthesized paths are then evaluated through the Trajectory Reward Model, which uses a stratified scoring framework that focuses more on aspects of logical consistency as well as effective task completion. This ensures diversity and high quality of data, thus providing a solid foundation for training.
Extensive experiments were conducted using benchmarks such as AndroidWorld and WebArena, which mimic complex and dynamic environments. Vision-language models were used as a base framework for the training process, specifically Qwen2-VL and InternVL2. The training focused on improving both sophisticated task planning and accurate execution of low-level actions to enable deep learning of skills for GUI agents.
OS-Genesis was successfully validated on a variety of benchmarks. At AndroidWorld, success rates were almost double those of task-based methods with respect to the ability to improve task planning and execution. In AndroidControl, the method worked very well at the high level of autonomous planning but also at the low level of step-by-step execution, including out-of-distribution examples; This demonstrates robustness. At WebArena, the approach consistently outperformed traditional baselines, thus gaining ground in handling complex and interactive environments. In summary, these results demonstrate the ability of OS-Genesis to generate high-quality trajectories of all types, thereby greatly improving the overall effectiveness of GUI agents in general situations.
OS-Genesis is a revolutionary step in training GUI agents as it overcomes the limitations of current data collection methods. Its interaction-based methodology and reward-based evaluation ensure diverse, high-quality training data that bridges the gap between abstract task instructions and dynamic GUI environments. This approach paves the way for significant progress in digital automation and ai research by enabling GUI agents to learn and adapt autonomously.
Verify he Paper, GitHub and Project page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn actionable insights to improve LLM model performance and accuracy while protecting data privacy..
Aswin AK is a consulting intern at MarkTechPost. He is pursuing his dual degree from the Indian Institute of technology Kharagpur. He is passionate about data science and machine learning, and brings a strong academic background and practical experience solving real-life interdisciplinary challenges.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>