Generating user intent from a sequence of user interface (UI) actions is a fundamental challenge for comprehensively understanding UI. Recent advances in multimodal large language models (MLLM) have led to substantial advances in this area, but their demands for extensive model parameters, computing power, and high latency make them impractical for scenarios requiring lightweight on-device solutions. with low latency or higher. privacy. Furthermore, the lack of high-quality data sets has hampered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data via self-supervised learning, combined with a fine-tuned LLM decoder for user intent prediction. . We also introduce two new multimodal UI-based datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed for few- and zero-shot UI understanding tasks. . IIW consists of 1.7 thousand videos in 219 intent categories, while IIT contains 914 videos in 10 categories. We establish the first baselines for these data sets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of large state-of-the-art MLLMs. but with a significant reduction. annotation and implementation resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, on average across two data sets. In particular, UI-JEPA achieves the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency on the IIW dataset. These results underline the effectiveness of UI-JEPA and highlight its potential for lightweight, high-performance UI understanding.