Multi-channel voice activation detection based on transform-average-concatenation

This paper was accepted into the HSCMA workshop at ICASSP 2024.

Voice Activation (VT) allows users to activate their devices by simply speaking an activation phrase. A front-end system is typically used to perform voice enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems only take audio from a single channel as input, channel selection is performed. A drawback of this approach is that unselected channels are discarded, even if the discarded channels might contain useful information for the TV. In this work, we propose multichannel acoustic models for VT, where the multichannel output of the front end is directly fed to a VT model. We adopt a transform-average-concatenation (TAC) block and modify the TAC block by incorporating the channel from conventional channel selection so that the model can serve a target speaker when multiple speakers are present. The proposed approach achieves up to 30% reduction in false rejection rate compared to the reference channel selection approach.

No Result