*=Equal taxpayers
This article was accepted into the Efficient Natural Language and Speech Processing workshop at NeurIPS 2023.
Interactions with virtual assistants often begin with a predefined trigger phrase followed by the user's command. To make interactions with the assistant more natural, we explored whether it is feasible to remove the requirement that users begin each command with a trigger phrase. We address this task by combining decoder signals from an automatic speech recognition (ASR) system with acoustic and lexical representations as input features to a large language model (LLM). We are interested in data- and resource-efficient systems that require only a small amount of training data and can potentially run on devices such as smartphones. For this reason, our model fits a small amount of multimodal data using low-rank adaptation. We compare the proposed system with unimodal models that rely solely on lexical or acoustic information. The effectiveness of our method is analyzed by fitting LLM only decoders with sizes between 3 billion and 13 billion parameters on training data consisting of 10 thousand to 80 thousand expressions. We show that our best multimodal system produces better results than unimodal baselines and uses only a fraction of the training data.