ScreenSpot-Pro: The First Benchmark Driving Multimodal LLMs Towards High-Resolution Professional GUI Agent and Computer Usage Environments

GUI agents face three critical challenges in professional environments: (1) the increased complexity of professional applications compared to general-purpose software, which requires a detailed understanding of complex designs; (2) the higher resolution of professional tools, resulting in smaller target sizes and lower grounding accuracy; and (3) reliance on additional tools and documents, which adds complexity to workflows. These challenges highlight the need for advanced solutions and benchmarks to improve GUI agent performance in these demanding scenarios.

Current GUI grounding models and benchmarks are insufficient to meet the requirements of the professional environment. Tools like ScreenSpot are designed for low-resolution tasks and lack the variety to accurately simulate real-world scenarios. Models like OS-Atlas and UGround are computationally inefficient and fail when the targets are small or the interface has many icons, which is common in professional applications. Furthermore, the absence of multilingual support reduces its applicability in global workflows. These shortcomings highlight the need for more comprehensive and realistic benchmarks to advance the field.

A team of researchers from the National University of Singapore, East China Normal University and Hong Kong Baptist University present ScreenSpot-Pro: a new framework designed for high-resolution professional environments. This benchmark has a data set of 1,581 tasks across 23 applications in industries such as development, creative tools, CAD, scientific platforms, and office suites. It features high-resolution full-screen images and expert annotations that ensure accuracy and realism. The multilingual guidelines cover both English and Chinese for an expanded range of assessment. ScreenSpot-Pro is unique in that it documents actual workflows that result in real, high-quality annotations, thus serving as a tool for complete evaluation and development of GUI grounding models.

The ScreenSpot-Pro dataset captures realistic and challenging scenarios. The basis of this dataset is formed by high-resolution images, where the target regions form an average of only 0.07% of the total screen, pointing to subtle and small GUI elements. The data was collected by professional users with experience in relevant applications, who used specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities to test bilingual functionality and contains several workflows to capture the subtleties of real professional tasks. These features make it particularly advantageous for evaluating and improving the accuracy and flexibility of GUI agents.

Analysis of current GUI grounding models using ScreenSpot-Pro reveals considerable shortcomings in their ability to handle high-resolution professional environments. OS-Atlas-7B achieved the highest accuracy rate of 18.9%. However, iterative methodologies, exemplified by ReGround, demonstrated the ability to improve performance, achieving 40.2% accuracy by fine-tuning predictions through a multi-step methodology. Minor components, such as icons, presented significant difficulties, while the bilingual tasks further highlighted the models' limitations. These findings emphasize the need for improved techniques that strengthen contextual understanding and resilience in complex GUI situations.

ScreenSpot-Pro sets a transformative benchmark for evaluating GUI agents in high-resolution professional environments. It addresses specific challenges in complex workflows, offering a diverse and accurate data set to guide innovations in the GUI foundation. This contribution forms the basis of much more intelligent and efficient agents that support a smooth performance of professional tasks, significantly boosting productivity and innovation in all fields of the industry.

Verify he Paper and Data. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.

UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn practical information to improve LLM model performance and accuracy while protecting data privacy..

Aswin AK is a consulting intern at MarkTechPost. He is pursuing his dual degree from the Indian Institute of technology Kharagpur. He is passionate about data science and machine learning, and brings a strong academic background and practical experience solving real-life interdisciplinary challenges.

<a target="_blank" href="https://x.com/Marktechpost”> Follow us on x (twitter) to receive regular ai research and development updates here…

ScreenSpot-Pro: The First Benchmark Driving Multimodal LLMs Towards High-Resolution Professional GUI Agent and Computer Usage Environments

Technical Terrence Team

BP stock is expected to return 30% in 2025, and it's dirt cheap with a P/E of 5.8!

Leave a Reply Cancel reply

Recommended.

Meet Taylor AI: A YC-funded startup that uses its API for large-scale text classification and is cheaper than an LLM

Endeavor Silver sets guidance for 2023 production and costs (NYSE: EXK)

Coinbase sues SEC and FDIC over regulatory transparency

3 growth actions that help FTSE 100 have its best month in more than 2 years

Roofstock sells second historic home as NFT

Categories

Important Links

ScreenSpot-Pro: The First Benchmark Driving Multimodal LLMs Towards High-Resolution Professional GUI Agent and Computer Usage Environments

Related

Technical Terrence Team

BP stock is expected to return 30% in 2025, and it's dirt cheap with a P/E of 5.8!

Leave a Reply Cancel reply

Recommended.

Meet Taylor AI: A YC-funded startup that uses its API for large-scale text classification and is cheaper than an LLM

Endeavor Silver sets guidance for 2023 production and costs (NYSE: EXK)

Coinbase sues SEC and FDIC over regulatory transparency

3 growth actions that help FTSE 100 have its best month in more than 2 years

Roofstock sells second historic home as NFT

Categories

Important Links

Get daily news updates to your inbox!