In recent years, language models have become the talk of the town. These models process, produce, and use natural language text to drive some innovative AI applications. LLMs like GPT-3, T5, and PaLM have performed significantly better. These models have begun to mimic humans by learning to read, code complete, summarize, and generate textual data. GPT-3, the recent model developed by OpenAI, has amazing capabilities and shows great performance. It has a transformative architecture for processing text, resulting in a model that can easily produce content and answer questions like a human would.
Researchers have been constantly studying how natural language can communicate with computing devices. Not so long ago, LLMs have shown some improvements in interacting with such devices without requiring any models or large data sets. With that in mind, some researchers have developed a paper that explores the practicality and feasibility of using a single large language model to start conversations with a mobile graphical user interface (GUI). Previous studies have only been able to find a few components to enable conversational interaction with a mobile user interface (UI). It required specific task models, massive data sets, and a lot of training effort. Also, not much progress has been seen in the use of LLM for GUI interaction tasks. Researchers have now found how to use LLMs to have various interactions with mobile UIs. They have devised some indication techniques to adjust an LLM to a mobile user interface.
The team has developed request methods so that interaction designers can easily prototype and test novel language interactions with users. With this, LLMs can modify how conversational interaction designs are operated and developed. This can save a lot of time, effort, and money instead of searching for models and data sets. The researchers also designed an algorithm that can convert view hierarchy data in Android syntax to HTML. Since HTML syntax is already present in LLM training data, LLMs can be adapted to mobile user interfaces.
The researchers have experimented with four modeling tasks to ensure the feasibility of their approach. These are: Screen Question Generation, Screen Summary, Screen Question Answer, and UI Map-to-Action Instruction. The results showed that their approach achieves competitive performance using only two data samples per task.
- On-Screen Question Generation: LLMs outperformed previous approaches by influencing the context of the user interface with input fields to generate questions.
- Screen Summary: Compared to the reference model (Screen2Words, UIST ’21), the study found that LLMs can efficiently summarize the vital functionalities of a mobile user interface and produce more accurate summaries.
- On-Screen Question Answering: Compared to the out-of-the-box QA model that answers 36% of the questions correctly, the 2-Trial LLM produced exact match answers for 66.7% of the questions.
- Map Instruction to UI Action: LLMs predict the UI object that is required to perform the taught action. The model did not surpass the reference model, but it showed a great result with the help of only two shots.
The goal of making possible the interaction between natural language and computing devices has been a search in human-computer interaction. These recent studies can make this possible and bring a breakthrough in Artificial Intelligence.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.