Recent advances in multimodal large language models (MLLM) have been noteworthy; However, these domain-general MLLMs often fall short in their ability to effectively understand and interact with user interface (UI) screens. In this article, we present Ferret-UI, a new MLLM designed to improve the understanding of mobile UI screens, equipped with referencing, grounding, and reasoning capabilities. Since UI screens typically exhibit a longer aspect ratio and contain smaller objects of interest (e.g., icons, text) than natural images, we built “any resolution” on top of Ferret to magnify details and take advantage Improved visual features. Specifically, each screen is split into 2 sub-images based on the original aspect ratio (i.e. horizontal split for vertical screens and vertical split for horizontal screens). Both subimages are coded separately before being sent to the LLMs. We meticulously collected training samples from a wide range of elementary UI tasks, such as icon recognition, text search, and widget listing. These samples are formatted to follow instructions with region annotations for easy reference and accurate connections. To increase the reasoning ability of the model, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and feature inference. After training on the selected data sets, Ferret-UI shows an outstanding understanding of user interface screens and the ability to execute open instructions. For model evaluation, we establish a comprehensive benchmark that covers all the aforementioned tasks. Ferret-UI excels not only beyond most open source UI MLLMs, but also outperforms GPT-4V in all elementary UI tasks.