Ferret-UI: Understanding Mobile UI Based on Multimodal LLMs

Recent advances in multimodal large language models (MLLM) have been noteworthy; However, these domain-general MLLMs often fall short in their ability to effectively understand and interact with user interface (UI) screens. In this article, we present Ferret-UI, a new MLLM designed to improve the understanding of mobile UI screens, equipped with referencing, grounding, and reasoning capabilities. Since UI screens typically exhibit a longer aspect ratio and contain smaller objects of interest (e.g., icons, text) than natural images, we built “any resolution” on top of Ferret to magnify details and take advantage Improved visual features. Specifically, each screen is split into 2 sub-images based on the original aspect ratio (i.e. horizontal split for vertical screens and vertical split for horizontal screens). Both subimages are coded separately before being sent to the LLMs. We meticulously collected training samples from a wide range of elementary UI tasks, such as icon recognition, text search, and widget listing. These samples are formatted to follow instructions with region annotations for easy reference and accurate connections. To increase the reasoning ability of the model, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and feature inference. After training on the selected data sets, Ferret-UI shows an outstanding understanding of user interface screens and the ability to execute open instructions. For model evaluation, we establish a comprehensive benchmark that covers all the aforementioned tasks. Ferret-UI excels not only beyond most open source UI MLLMs, but also outperforms GPT-4V in all elementary UI tasks.

Ferret-UI: Understanding Mobile UI Based on Multimodal LLMs

Technical Terrence Team

New supports and objectives for Friday

Leave a Reply Cancel reply

Recommended.

Instagram adds what photos have always needed: words

Bitcoin selloffs hint at possible volatility: Is BTC bull run at risk?

Sotheby’s to Auction 3AC’s NFT Collection Including Larva Labs’ Zombie Punk and Dmitri Cherniak’s ‘Golden Goose’ – Bitcoin News

Ethereum ICO Whale Sells 19,000 ETH Worth Over $47 Million Amid Market FUD

Daily Crunch: AWS Now Accepting Applications For Its New 10-Week Generative AI Accelerator

Categories

Important Links

Ferret-UI: Understanding Mobile UI Based on Multimodal LLMs

Related

Technical Terrence Team

New supports and objectives for Friday

Leave a Reply Cancel reply

Recommended.

Instagram adds what photos have always needed: words

Bitcoin selloffs hint at possible volatility: Is BTC bull run at risk?

Sotheby’s to Auction 3AC’s NFT Collection Including Larva Labs’ Zombie Punk and Dmitri Cherniak’s ‘Golden Goose’ – Bitcoin News

Ethereum ICO Whale Sells 19,000 ETH Worth Over $47 Million Amid Market FUD

Daily Crunch: AWS Now Accepting Applications For Its New 10-Week Generative AI Accelerator

Categories

Important Links

Get daily news updates to your inbox!