Large Language Models (LLM) and Vision Language Models (VLM) have revolutionized the automation of mobile device control through natural language commands, offering solutions for complex user tasks. The conventional approach, “step-by-step GUI agents,” operates by querying the LLM at each GUI state for dynamic decision making and reflection, continuously processing the user task and observing the GUI state until completion. However, this method faces significant challenges as it relies heavily on powerful cloud-based models such as GPT-4 and Claude. This raises critical concerns about the privacy and security risks of sharing personal GUI pages, substantial user-side traffic consumption, and high server-side centralized servicing costs, making large-scale deployment of GUI agents be problematic.
Previous attempts to automate mobile tasks relied heavily on template-based methods like Siri, Google Assistant, and Cortana, which used predefined templates to process user input. More advanced GUI-based automation emerged to handle complex tasks without relying on third-party APIs or extensive programming. While some researchers focused on improving small language models (SLMs) through GUI-specific training and exploration-based knowledge acquisition, these approaches faced significant limitations. Script-based GUI agents had particular problems with the dynamic nature of mobile applications, where states and UI elements change frequently, making it difficult to extract knowledge and execute scripts. commands.
Researchers from the ai Industry Research Institute (AIR) of Tsinghua University proposed AutoDroid-V2 to investigate how to build a powerful GUI agent based on the coding capabilities of SLMs. Unlike traditional step-by-step GUI agents that make decisions one action at a time, AutoDroid-V2 uses a script-based approach that generates and executes multi-step scripts based on user instructions. Additionally, it addresses two critical limitations of conventional approaches:
- Efficiency: Agents can generate a single script for a series of GUI actions to complete a task based on the user's task, significantly reducing query frequency and consumption.
- Capability: Script-based GUI agents rely on the coding capability of SLMs, the effectiveness of which has been demonstrated by numerous existing studies on lightweight coding assistants.
The architecture of AutoDroid-V2 consists of two distinct stages: online and offline processing. In the offline stage, the system begins by constructing an application document by comprehensively analyzing the application's browsing history. This document serves as the foundation for script generation, incorporating ai-guided GUI state compression, automatic XPath element generation, and GUI dependency analysis to ensure both conciseness and accuracy. During the online stage, when a user submits a task request, the custom local LLM generates a multi-step script, which is then executed by a domain-specific interpreter designed to handle runtime execution reliably and efficiently. .
The performance of AutoDroid-V2 is evaluated across two benchmarks, testing 226 tasks on 23 mobile applications against leading baselines including AutoDroid, SeeClick, CogAgent and Mind2Web. Shows significant improvements, achieving 10.5% to 51.7% higher task completion rate while reducing computational demands with 43.5x and 5.8x reductions in token consumption. input and output, respectively, and 5.7 to 13.4 times lower LLM inference latency compared to baselines. Tests on different LLMs (Llama3.2-3B, Qwen2.5-7B and Llama3.1-8B) AutoDroid-V2 show consistent performance with success rates ranging from 44.6% to 54.4%, maintaining a Stable inverted redundancy ratio between 90.5% and 93.0%.
In conclusion, the researchers introduced AutoDroid-V2, which represents a significant advance in mobile task automation through its innovative script-based and document-driven approach using on-device SLM. Experimental results demonstrate that this script-based methodology substantially improves the efficiency and performance of GUI agents, achieving levels of accuracy comparable to cloud-based solutions while maintaining device-level privacy and security. Despite these achievements, the system faces limitations when dealing with applications that lack structured text representations of their GUIs, such as Unity-based and Web-based applications. However, this challenge could be addressed by integrating VLM to recover structured GUI representations based on visual features.
Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn practical information to improve LLM model performance and accuracy while protecting data privacy..
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. Its goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>