By allowing users to connect with tools and services, systems that can follow instructions from graphical user interfaces (GUIs) can automate laborious work, increase accessibility, and increase the utility of digital assistants.
Many GUI-based digital agent implementations rely on HTML-derived textual representations, which are not always available. People use GUIs by perceiving visual input and acting on it with standard mouse and keyboard shortcuts; they don’t need to look at the application’s source code to find out how the program works. Regardless of the underlying technology, they can quickly acquire new programs with intuitive graphical user interfaces.
The Atari game system is just one example of how well a system that learns from inputs of just pixels can work. However, there are many pitfalls in learning from pixel-only inputs along with generic low-level actions when attempting to perform GUI-based instruction-following tasks. To visually interpret a GUI, one must be familiar with the interface structure, be able to recognize and interpret visually located natural language, recognize and identify visual elements, and predict the functions and methods of interaction of those elements.
Google DeepMind and Google introduce PIX2ACT, a model that takes pixel-based screenshots as input and chooses actions that match fundamental mouse and keyboard controls. For the first time, the research group demonstrates that an agent with only pixel inputs and a generic action space can outperform human workers, achieving performance on par with next-generation agents using DOM information and comparable numbers. of human demonstrations. .
To do this, the researchers extend PIX2STRUCT. This Transformer-based image-to-text model has already been trained on large-scale inline data to convert screenshots into structured HTML-based renderings. PIX2ACT applies tree searching to repeatedly construct new expert trajectories for training, employing a combination of human demonstrations and interactions with the environment.
The team effort involves creating a framework for universal browser-based environments and adapting two reference data sets, MiniWob++ and WebShop, for use in their environment using a standard cross-domain look and action format. With the proposed option (CC-Net without DOM), PIX2ACT outperforms human workers on a networking platform by approximately four times on MiniWob++. The ablations demonstrate that pixel-based pretraining of PIX2STRUCT is essential for PIX2ACT performance.
For GUI-based instruction after pixel-based inputs, the findings demonstrate the effectiveness of PIX2STRUCT pretraining through screenshot analysis. Pretraining in a behavioral cloning environment increases MiniWob++ and WebShop task scores by 17.1 and 46.7, respectively. Although there is still a performance disadvantage compared to larger language models that use HTML-based input and task-specific actions, this work established the first baseline in this environment.
review the Paper. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.