Voice assistants help users make phone calls, send messages, create events, navigate, and do much more. However, assistants have a limited ability to understand the context of their users. In this paper we intend to take a step in that direction. Our work dives into a new experience for users to see phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies on reference comprehension, which becomes particularly interesting when there are multiple similar texts on the screen, similar to the visual base. We have collected a data set and proposed a lightweight general purpose model for this new experience. Due to the high cost of consuming pixels directly, our system is designed to depend on the text extracted from the user interface. Our model is modular, so it offers flexibility, improved interpretability, and efficient memory utilization at runtime.