*= All authors listed contributed equally to this work.
Successfully managing context is essential to any dialogue comprehension task. This context can be conversational (based on previous user queries or system responses), visual (based on what the user sees, for example, on their screen), or background (based on cues such as the sound of an alarm or the music playback). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a natural language understanding system, responsible for handling conversational, visual and background context. In particular, we introduce different machine learning models to enable contextual query management; specifically, one to enable reference resolution and another to handle context by rewriting queries. We also describe how these models complement each other to form a unified, coherent, and lightweight system that can understand context while preserving user privacy.