Spoken term detection (STD) is a critical area in speech processing, allowing the identification of specific phrases or terms in large audio files. This technology is widely used in voice-based search, transcription services, and multimedia indexing applications. By facilitating the retrieval of spoken content, STD plays a critical role in improving the accessibility and usability of audio data, especially in domains such as podcasts, conferences, and broadcast media.
A major challenge in spoken term detection is the effective handling of out-of-vocabulary (OOV) terms and the computational demands of existing systems. Traditional methods often rely on automatic speech recognition (ASR) systems, which are resource-intensive and error-prone, particularly in short-duration audio segments or under variable acoustic conditions. Additionally, these methods need help to accurately segment continuous speech, making it difficult to identify specific terms without context.
Existing approaches to ETS include ASR-based techniques that use phoneme or grapheme networks, as well as dynamic time warping (DTW) and acoustic word embeddings for direct audio comparisons. While these methods have their advantages, they are limited by speaker variability, computational inefficiency, and challenges in processing large data sets. Current tools also need help generalizing to different data sets, especially for terms that were not found during training.
Researchers from the Indian Institute of technology Kanpur and imec – Ghent University have introduced a novel voice tokenization framework called BEST-STD. This approach encodes speech into discrete, speaker-independent semantic tokens, allowing for efficient retrieval with text-based algorithms. By incorporating a bidirectional Mamba encoder, the framework generates highly consistent token sequences across different expressions of the same term. This method eliminates the need for explicit segmentation and handles OOV terms more effectively than previous systems.
The BEST-STD system uses a bi-directional Mamba encoder, which processes audio input both forward and backward to capture long-range dependencies. Each encoder layer projects audio data into high-dimensional embeddings, which are discretized into token sequences via a vector quantizer. The model employs a self-supervised learning approach, leveraging dynamic time warping to align expressions of the same term and create positive anchor pairs at the frame level. The system uses an inverted index to store tokenized sequences, allowing efficient retrieval by comparing the similarity of tokens. During training, the system generates consistent symbolic representations, ensuring invariance for speaker and acoustic variations.
The BEST-STD framework demonstrated superior performance in evaluations performed on the LibriSpeech and TIMIT datasets. Compared to traditional STD methods and state-of-the-art tokenization models such as HuBERT, WavLM, and SpeechTokenizer, BEST-STD achieved significantly higher Jaccard similarity scores for token consistency, with unigram scores reaching 0.84 and bigram scores of 0.78. The system outperformed baselines on spoken content retrieval tasks in average mean precision (MAP) and mean reciprocal range (MRR). For vocabulary terms, BEST-STD achieved MAP scores of 0.86 and MRR scores of 0.91 on the LibriSpeech dataset, while for OOV terms, the scores reached 0.84 and 0.90 respectively. These results underscore the system's ability to generalize effectively across different types of terms and data sets.
In particular, the BEST-STD framework also excelled in speed and retrieval efficiency, benefiting from an inverted index for tokenized sequences. This approach reduced the reliance on computationally intensive DTW-based matching, making it scalable for large data sets. The Mamba bidirectional encoder, in particular, proved to be more effective than transformer-based architectures due to its ability to model detailed temporal information that is critical for spoken term detection.
In conclusion, the introduction of BEST-STD marks a significant advance in spoken term detection. By addressing the limitations of traditional methods, this approach offers a robust and efficient solution for audio recovery tasks. The use of speaker-independent tokens and a bidirectional Mamba encoder not only improves performance but also ensures adaptability to diverse data sets. This framework holds promise for real-world applications, paving the way for better accessibility and searchability in audio processing.
Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>