Computational understanding of user interfaces (UIs) is a key step in achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including widget captions, screen summary, and command grounding, which address various interaction scenarios, such as automation and accessibility. We also demonstrate how machine learning can help UX professionals improve UI quality by diagnosing touch-and-click confusion. providing ideas to improve the design of the user interface. This work, along with that developed by others in the field, has shown how deep neural networks can potentially transform end-user experiences and interaction design practice.
With these successes in tackling individual UI tasks, a natural question is whether we can gain basic knowledge of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed a multitasking model to tackle a variety of user interface tasks simultaneously. Although work has progressed a bit, some challenges remain. Previous UI models are heavily based on UI View Hierarchies — that is, the structure or metadata of a mobile user interface screen such as the Document object model for a web page, which allow a model to directly acquire detailed information about user interface objects on the screen (for example, their types, text content, and positions). This metadata has given earlier models advantages over their vision-only counterparts. However, view hierarchies are not always accessible and often corrupt with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it can ultimately hinder model performance and applicability. Additionally, previous models had to deal with heterogeneous information across data sets and UI tasks, often resulting in complex model architectures that were difficult to scale or generalize across tasks.
In “Featured: Understanding the mobile user interface using vision-language models with a focus”, accepted for publication in ICLR 2023, we present a vision-only approach that aims to achieve a general understanding of the user interface entirely from raw pixels. We present a unified approach to rendering various UI tasks, whose information can be universally represented using two main modalities: vision and language. The view mode captures what a person would see from a user interface screen, and the language mode can be natural language or any sequence of tokens related to the task. We demonstrated that Spotlight substantially improves accuracy across a variety of UI tasks, including widget captioning, screen summarization, command grounding, and tapability prediction.
![]() |
focus model
The Spotlight model input includes a tuple of three elements: the screenshot, the region of interest on the screen, and the text description of the task. The result is a text description or response about the region of interest. This simple input/output representation of the model is expressive for capturing various user interface tasks and enables scalable model architectures. The design of this model allows for a spectrum of learning strategies and configurations, from task-specific fine-tuning to multi-task learning to few-shot learning. Spotlight’s model, as illustrated in the figure above, leverages existing architecture building blocks such as ViT and T5, which are pre-trained in the resource-rich overview language domain, allowing us to build on top of it. success of these domain-general models.
Because UI tasks often relate to a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a model focus region extractor of vision language that allows the model to focus on the region in light of the screen context.
In particular, we designed a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings using attention consultations generated from the bounding box of the region (see paper for more details). Specifically, each coordinate (a scalar value, i.e. left, top, right, or bottom) of the bounding box, indicated as a yellow box in the screenshot, is first embedded via a multilayer perceptron (MLP) as a collection of dense vectors, and is then fed into a Transformer model along its embedding of coordinate type. The dense vectors and their corresponding coordinate type embeddings are color-coded to indicate their affiliation with each coordinate value. Coordinate inquiries and then attend to screen codes issued by ViT through cross attentionand the final attention output of the Transformer is used as the region representation for downstream decoding by T5.
![]() |
A target region on the screen is summarized using its bounding box to query the ViT screen encodings via attention mechanisms. |
Results
We pretrained the Spotlight model using two unlabeled datasets (an internal dataset based on body C4 and an internal mobile dataset) with 2.5 million mobile user interface screens and 80 million web pages. We then separately fit the pretrained model for each of the four post-tasks (captioning, summarizing, grounding, and tapping). For widget caption and screencap tasks, we report Cider scores, which measure how similar a model text description is to a set of references created by human evaluators. For command grounding, we report accuracy which measures the percentage of times the model successfully locates a target object in response to a user command. For the prediction of tappability, we report F1 scores that measure the model’s ability to distinguish between touchable and untouchable objects.
In this experiment, we benchmark Spotlight against various benchmarks. widget legend uses the view hierarchy and image of each UI object to generate a text description for the object. Similarly, Screen2Words uses the view hierarchy and screenshot, as well as auxiliary functions (eg, application description) to generate a summary of the screen. In the same vein BUT Combine screenshots and view hierarchies to multitask. Finally, the original ability to play the model takes advantage of the metadata of the objects in the view hierarchy and the screenshot to predict the possibility of touching the objects. tapeperceptiona tracking model ability to play, uses a vision-only touchability prediction approach. We examined two variants of the Spotlight model with respect to the size of its ViT building block, which includes B/16 and L/16. Spotlight dramatically outperformed the state of the art on four UI modeling tasks.
Model | Subtitle | summary | Grounding | ability to play | |||||||||||
baselines |
widget legend | 97 | – | – | – | ||||||||||
Screen2Words | – | 61.3 | – | – | |||||||||||
BUT | 99.3 | 65.6 | 82.1 | – | |||||||||||
tapeperception | – | – | – | 85.5 | |||||||||||
ability to play | – | – | – | 87.9 | |||||||||||
Stand out | B/16 | 136.6 | 103.5 | 95.7 | 86.9 | ||||||||||
L/16 | 141.8 | 106.7 | 95.8 | 88.4 |
We then look for a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multitasking model can substantially reduce the model’s footprint. As shown in the table below, the experiments showed that our model still performs competitively.
Model | Subtitle | summary | Grounding | ability to play | ||||||||||
multitasking VUT | 99.3 | 65.1 | 80.8 | – | ||||||||||
Projector B/16 | 140 | 102.7 | 90.8 | 89.4 | ||||||||||
Projector L/16 | 141.3 | 99.2 | 94.2 | 89.5 |
To understand how the Region Summarizer allows Spotlight to focus on a target region and relevant areas on the screen, we looked at the attention weights (which indicate where the model’s attention is on the screenshot) for both widget caption and screen summary tasks. In the figure below, for the widget caption task, the model predicts “select Chelsea team” for the checkbox on the left hand side, highlighted with a red bounding box. We can see from its attention heat map (illustrating the distribution of attention weights) on the right that the model learns to pay attention not only to the target region of the checkbox, but also to the text “Chelsea” in the far left to generate the subtitle. For the screen summary example, the model predicts “page showing tutorial for a learning app” given the screenshot on the left. In this example, the target region is full screen, and the model learns to pay attention to important parts on the summary screen.
Conclusion
We demonstrate that Spotlight outperforms previous methods that use screenshots and display hierarchies as input, and sets next-generation results across multiple representative UI tasks. These tasks range from accessibility, automation, to interaction design and evaluation. Our vision-only approach to understanding the mobile UI alleviates the need to use the view hierarchy, allows the architecture to scale easily, and benefits from the success of large pre-trained vision language models for the general domain. . Compared to the great recent efforts of the vision-language model, such as Flemish and PaLI, Spotlight is relatively small and our experiments show a trend that larger models produce better performance. Spotlight can easily be applied to more UI tasks and potentially advance many user experience and interaction task fronts.
Recognition
We thank Mandar Joshi and Tao Li for their assistance in processing the web pretraining dataset, and Chin-Yi Cheng and Forrest Huang for their comments to correct the document. Thanks to Tom Small for his help in creating the animated figures in this post.