Previously, we presented the 1000 languages initiative and the Universal Model of Speech with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which are based on projects like VDTTS and AudioLM, for users who speak many different languages.
After developing a new model, it must be evaluated whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there must be no acoustic artifacts such as cracks or correlated signals. noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.
The most popular method of evaluating the quality of synthetic speech models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, submits them for human evaluation, and receives the results a few days later. . This evaluation phase normally involves listening tests, during which dozens of scorers listen to the statements one after the other to determine how natural they sound. While humans have yet to be beaten at detecting whether a piece of text sounds natural, this process can be impractical, especially in the early stages of research projects, when engineers need quick feedback to test and restrategize their approach. . Human evaluation is expensive, time-consuming, and can be limited by the availability of evaluators for the languages of interest.
Another barrier to progress is that different projects and institutions often use various classifications, platforms, and protocols, making apples-to-apples comparisons impossible. In this sense, speech synthesis technologies lag behind text generation, where researchers have long supplemented human evaluation with automated metrics such as BLUE or, more recently, BLEURT.
In “SQuId: Measurement of the naturalness of speech in many languages“, to be presented in ICASSP 2023, we present SQuId (Speech Quality Identification), a 600M parameter regression model that describes how natural a part of speech sounds. SQuId is based on mSLAM (a pre-trained text and speech model developed by Google), fitted on over a million quality scores in 42 languages and tested in 65. We demonstrate how SQuId can be used to supplement human scores for assessment of many languages. This is the largest published effort of its kind to date.
TTS assessment with SQuId
The main hypothesis behind SQuId is that training a regression model on previously collected scores can provide us with a low-cost method to assess the quality of a TTS model. Thus, the model can be a valuable addition to a TTS researcher’s assessment toolbox, providing a near-instantaneous, albeit less accurate, alternative to human assessment.
SQuId takes an expression as input and an optional locale tag (that is, a localized variant of a language, such as “Brazilian Portuguese” or “British English”). Returns a score between 1 and 5 indicating how natural the waveform sounds, with a higher value indicating a more natural waveform.
Internally, the model includes three components: (1) an encoder, (2) a clustering/regression layer, and (3) a fully connected layer. First the encoder it takes a spectrogram as input and embeds it in a smaller 2D matrix containing 3200 vectors of size 1024, where each vector encodes a time step. The pooling/regression layer aggregates the vectors, adds the locale tag, and sends the result to a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score to be within the [1, 5] range, which is common for human qualifications of naturalness. We train the entire model from start to finish with a regression loss.
The encoder is by far the largest and most important piece of the model. We use mSLAMa pre-existing 600M parameter shaper pre-trained in both speech (51 languages) and text (101 languages).
The SQuid model. |
To train and test the model, we created the SQuId corpus: a collection of 1.9 million qualified utterances in 66 languages, collected for more than 2,000 TTS research projects and products. The SQuId corpus covers a diverse range of systems, including concatenative and neural models, for a wide range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a wide range of TTS errors, such as acoustic artifacts (eg, clicks and pops), incorrect prosody (eg, questions without rising intonations in English), normalization errors of the text (eg, speaking “7 /7” as “seven divided by seven” instead of “July 7”), or mispronunciations (eg, speaking “tough” as “toe” ).
A common problem that arises when training multilingual systems is that training data may not be uniformly available for all languages of interest. Squidward was no exception. The following figure illustrates the size of the corpus for each location. We see that the distribution is largely dominated by American English.
Local distribution in the SQuId data set. |
How can we provide good performance for all languages when such variations exist? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, instead of using separate models for each language. The assumption is that if the model is large enough, then transfer between locations may occur: the accuracy of the model in each locality improves as a result of joint training in the others. As our experiments show, cross locale proves to be a powerful performance booster.
Experimental results
To understand the overall performance of SQuId, we compared it to a custom Big-SSL-MOS model (described in the paper), a competitive base inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained in VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model and found that SQuId is up to 50.0% more accurate.
SQuId versus last generation baselines. We measure agreement with human ratings using the kendall tauwhere a higher value represents better precision. |
To understand the impact of transfer between locations, we conducted a series of ablation studies. We vary the number of locales introduced into the training set and measure the effect on the accuracy of SQuId. In English, which is already overrepresented in the data set, the effect of adding locales is negligible.
US English SQuId performance, using local 1, 8, and 42 during fine tuning. |
However, transferring between locales is much more effective for most other locales:
SQuId performance in four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine tuning. For each locale, we also provide the size of the training set. |
To push the transfer to the limit, we kept 24 venues during training and used them exclusively for testing. Thus, we measure to what extent SQuId can handle languages that it has never seen before. The following graph shows that although the effect is not uniform, the transfer between locales works.
SQuId’s performance in four “zero shot” venues; using 1, 8 and 42 locations during fine tuning. |
When does cross locale work and how? We present many more ablations in the paperand show that while language similarity plays a role (for example, Brazilian Portuguese training helps European Portuguese), it is surprisingly far from the only factor that matters.
Conclusion and future work
Introducing SQuId, a 600 million parameter regression model that leverages the SQuId dataset and cross-local learning to assess speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human testers in the assessment of many languages. Future work includes improvements to accuracy, expanding the range of languages covered, and addressing new types of bugs.
Thanks
The author of this post is now part of Google DeepMind. Many thanks to all the authors of the article: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.