From low-level tasks to high-level tasks: scaling with the ANDROIDCONTROL dataset

Large language models (LLMs) have shown promise in powering autonomous agents that control computer interfaces to perform human tasks. However, without fine-tuning human-collected task demonstrations, the performance of these agents remains relatively poor. A key challenge lies in developing viable approaches to building real-world computing control agents that can effectively execute complex tasks in diverse applications and environments. Current methodologies, which rely on pretrained LLMs without task-specific adjustments, have achieved only limited success, with reported task success rates ranging from 12% to 46% in recent studies.

Previous attempts to develop computer control agents have explored various approaches, including zero-shot and few-shot stimuli of large language models, as well as fine-tuning techniques. Zero-shot cueing methods use pretrained LLMs without any task-specific adjustment, while few-shot approaches provide a small number of examples to the LLM. Tuning methods involve further training the LLM in task demonstrations, either end-to-end or for specific capabilities, such as identifying interactable user interface elements. Notable examples include SeeAct, WebGPT, WebAgent, and Synapse. However, these existing methods have limitations in terms of performance, domain generalization, or complexity of the tasks they can effectively handle.

Google DeepMind and Google researchers present ANDROID CONTROL, a large-scale data set of 15,283 human demonstrations of tasks performed in Android applications. A key feature of ANDROIDCONTROL is that it provides high- and low-level human-generated instructions for each task, allowing investigation of the levels of task complexity that the models can handle while also offering more complete oversight during the process. training. Additionally, it is the most diverse UI control dataset to date, comprising 15,283 unique tasks across 833 different Android apps. This diversity allows the generation of multiple test splits to measure performance both within and outside the task domain covered by the training data. The proposed method involves using ANDROIDCONTROL to quantify how fine-tuning scales when applied to low- and high-level tasks, both in-domain and out-of-domain, and comparing fine-tuning approaches with various zero-shot and few-shot baselines. . .

The ANDROIDCONTROL dataset was collected over a year through crowdsourcing. Crowdworkers were given generic feature descriptions for apps in 40 different categories and asked to create them into specific tasks involving apps of their choice. This approach led to the collection of 15,283 task demos spanning 833 Android apps, including popular apps and less popular or regional ones. For each task, annotators first provided a high-level natural language description. They then performed the task on a physical Android device, with their actions and associated screenshots captured. Importantly, the annotators also provided low-level natural language descriptions of each action before executing it. The resulting data set contains high- and low-level instructions for each task, allowing analysis of different levels of task complexity. Careful splits of the data sets were created to measure performance within and outside the domain.

The results show that for in-domain evaluation on the IDD subset, the LoRA-tuned models outperform the zero-shot and few-shot methods when trained with sufficient data, despite using the smaller PaLM 2S model. Even with only 5 training episodes (LT-5), the LoRA tuning outperforms all untuned models on low-level instructions. For high-level instructions, 1,000 episodes are required. The best LoRA-fitted model achieves an accuracy of 71.5% on high-level instructions and 86.6% on low-level instructions. Among the zero-shot methods, AitW with PaLM 2L performs the best (56.7%) on low-level instructions, while M3A with GPT-4 has the best performance (42.1%) on high-level instructions, probably benefiting of incorporating high-level reasoning. Surprisingly, low shot performance is largely inferior to zero shot performance across the board. The results highlight the strong benefits of domain adjustment, especially for obtaining more data.

This work introduced ANDROIDCONTROL, a large and diverse dataset designed to study model performance on low- and high-level tasks, both in-domain and out-of-domain, as the training data is scaled. Through evaluation of fitted LoRA models on this data set, it is predicted that achieving 95% accuracy on low-level tasks in the domain would require around 1 million training episodes, while a completion rate of 95% episodes on high-level 5-step tasks in -mastery tasks would require approximately 2 million episodes. These results suggest that, while potentially costly, tuning may be a viable approach to obtaining high mastery performance on complex tasks. However, out-of-domain performance requires one to two orders of magnitude more data, indicating that tuning alone may not scale well and that additional approaches may be beneficial, especially for robust performance on high-level tasks. outside the domain.

Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our 44k+ ML SubReddit

Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.