A critical challenge in Subjective Speech Quality Assessment (SSQA) is allowing models to generalize across diverse and unseen speech domains. General SSQA models evaluate many models for poor performance outside their training domain, mainly because such a model often faces performance difficulties in several domains, however, due to the quite different data characteristics and scoring systems that exist between the different types of SSQA tasks. Including TTS, VC and speech enhancement, it is equally challenging. Effective generalization of SSQA is necessary to ensure the alignment of human perception in these fields; However, many of these models remain limited to the data on which they have been trained, limiting their real-world usefulness in applications such as automated speech assessment for TTS and VC systems.
Current SSQA approaches include both reference-based and model-based methods. Reference-based models evaluate quality by comparing speech samples to a reference. On the other hand, model-based methods, especially DNNs, learn directly from human-annotated data sets. Model-based SSQA has great potential to capture human perception much more accurately but, at the same time, shows some very significant limitations:
- Generalization constraints: SSQA models often fail when tested with new data outside the domain, resulting in inconsistent performance.
- Dataset bias and corpus effect: Models can then overfit the characteristics of the data set with all its quirks, such as scoring biases or data types, which could make them less effective on different data sets.
- Computational complexity: Ensemble models increase the robustness of SSQA, but at the same time increase the computational cost compared to the reference model, reducing it to impractical possibilities for real-time evaluation in low-resource environments. The limitations mentioned above collectively hinder the development of good SSQA models, with the ability to generalize well across different data sets and application contexts.
To address these limitations, the researchers present MOS-Bench, a benchmark collection that includes seven training data sets and twelve test data sets across various speech types, languages, and sampling rates. In addition to MOS-Bench, SHEET is a proposed toolset that provides a standardized workflow for training, validating, and testing SSQA models. This combination of MOS-Bench with SHEET allows SSQA models to be systematically evaluated, and specifically implicates the generalization ability of the models. MOS-Bench incorporates the multiple data set approach, combining data from different sources to broaden the model's exposure to varying conditions. In addition to that, a new performance metric of best ratio/difference score is also introduced to provide a holistic evaluation of the performance of the SSQA model on these data sets. Not only does this provide a framework for consistent evaluation, but it generalizes better as the models adjust for real-world variability, which is quite a notable contribution to SSQA.
The MOS-Bench dataset collection consists of a wide range of datasets that have diversity in their sample rates and listener labels to capture cross-domain variability in SSQA. The main data sets are:
- BVCC – A dataset for English that comes with samples for TTS and VC.
- WE ARE: Speech quality data on English TTS models trained on LJSpeech.
- SingMOS: A sampling data set of Chinese and Japanese singing voices.
- NISQA: Noisy voice samples that have been the subject of communications over networks. The data sets are multilingual, from multiple domains and voice types for a generalized training scope. MOS-Bench uses the SSL-MOS model and modified AlignNet as the backbone, using SSL to learn rich feature representations. SHEET takes the SSQA process one step further with data processing, training, and testing workflows. SHEET also includes nonparametric kNN retrieval-based score inference to improve model fidelity. Additionally, hyperparameter tuning, such as batch size and optimization strategies, has been included to further improve model performance.
Using MOS-Bench and SHEET, both achieve enormous improvements in the generalization of SSQA on synthetic and non-synthetic test sets to the point where the models learn to achieve high ranks and highly faithful quality predictions even for out-of-domain data. Models trained on MOS-Bench datasets, such as PSTN and NISQA, are very robust on synthetic test sets, and the need for synthesis-focused data, as previously required for generalization, becomes obsolete. Furthermore, this addition of visualizations firmly established that models trained on MOS-Bench captured a wide variety of data distributions and reflected better adaptability and consistency. In this sense, the introduction of these results by MOS-Bench further establishes a reliable benchmark, allowing SSQA models to apply accurate performance in different domains with greater effectiveness and applicability of automated speech quality assessment. .
This methodology, through MOS-Bench and SHEET, was to challenge the generalization problem of SSQA across various data sets, as well as by introducing a new evaluation metric. By providing a reduction in dataset-specific biases and cross-domain applicability, this methodology will move the boundaries of SSQA research to enable models to generalize across applications effectively. An important advance is that MOS-Bench and its standardized toolset have collected cross-domain datasets. Interestingly enough, resources are now available for researchers to develop SSQA models that are robust in the presence of a variety of speech types and the presence of real-world applications.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
(ai Magazine/Report) Read our latest report on 'SMALL LANGUAGE MODELS'
Aswin AK is a Consulting Intern at MarkTechPost. He is pursuing his dual degree from the Indian Institute of technology Kharagpur. He is passionate about data science and machine learning, and brings a strong academic background and practical experience solving real-life interdisciplinary challenges.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>