This paper was accepted into the Adaptive Foundation Models (AFM) workshop at NeurIPS Workshop 2024.
Follow-up conversations with virtual assistants (VAs) allow a user to seamlessly interact with a VA without the need to repeatedly invoke them using a keyword (after the first consultation). Therefore, accurate device-directed speech detection (DDSD) from follow-up queries is critical to enabling a naturalistic user experience. To this end, we explore the notion of large language models (LLMs) and model the first query by making inferences about traces (based on ASR decoded text), by requesting a pre-trained LLM or by adapting a binary system. . classifier at the top of the LLM. In doing so, we also take advantage of the uncertainty of the ASR when designing the LLM prompts. We show on the real-world data set of follow-up conversations that this approach produces large gains (20-40% reduction in false alarms with 10% false rejections fixed) due to joint modeling of prior speech context and uncertainty. of the ASR, in comparison. even when the traces model themselves.