Less is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types

Suppressing unintentional device invocation due to speech that sounds like a wake word or accidentally pressing a button is critical to a good user experience and is known as False-Trigger-Mitigation (FTM). In the case of multiple invocation options, the traditional FTM approach is to use specific invocation models or a single model for all invocations. Both approaches are suboptimal: the memory cost for the first approach grows linearly with the number of invocation options, which is prohibitively expensive for on-device implementation and does not take advantage of shared training data; while the latter cannot accurately capture the acoustic differences between different types of summoning. To this end, we propose a Unified Acoustic Detector (UAD) for FTM when multiple invocation options are available on the device. The proposed UAD is trained using a multi-task learning framework, where a co-trained acoustic encoder model is augmented with invocation-specific classification layers. In the context of the FTM task, we show for the first time that by using the shared model architecture between calls (thus keeping the model size similar to that of a monolithic model used for a single type of call), we can not only not match but greatly improve the accuracy of specific invocation patterns. In particular, in the challenging case of touch invocation, we get a relative improvement of 50% and 35% in false positive rate with a 99% true positive rate, compared to a single output model for both invocations and separate models per invocation. , respectively. Furthermore, we propose streaming and non-streaming variants of the UAD, and show that both outperform a traditional ASR-based approach to FTM.

Less is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types

Technical Terrence Team

GPT4All is the Local ChatGPT for your Documents and it is Free!

Leave a Reply Cancel reply

Recommended.

Roland's Bridge Cast

Bubble Chart in Python – Analytics Vidhya

Grimace NFT holders get VIP access to new McDonald's metaverse

Community-Created Spirits Brand Degen Distillery Sells Its Origo X NFTs

AI Chip Startup DEEPX Raises $80M Series C at $529M Valuation

Categories

Important Links

Less is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types

Related

Technical Terrence Team

GPT4All is the Local ChatGPT for your Documents and it is Free!

Leave a Reply Cancel reply

Recommended.

Roland's Bridge Cast

Bubble Chart in Python – Analytics Vidhya

Grimace NFT holders get VIP access to new McDonald's metaverse

Community-Created Spirits Brand Degen Distillery Sells Its Origo X NFTs

AI Chip Startup DEEPX Raises $80M Series C at $529M Valuation

Categories

Important Links

Get daily news updates to your inbox!