Last November, we announced the 1,000 Languages Initiative, an ambitious commitment to build a machine learning (ML) model that would support the world’s 1,000 most-spoken languages, bringing greater inclusion to billions of people around the world. However, some of these languages are spoken by fewer than twenty million people, so a central challenge is how to support languages for which there are relatively few speakers or limited data available.
Today, we are excited to share more about the universal speech model (USM), a critical first step in supporting 1,000 languages. USM is a family of next-generation speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning more than 300 languages. USM, which is used on YouTube (eg for subtitles), can perform automatic speech recognition (ASR) not only in widely spoken languages like English and Mandarin, but also in under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani, to name a few. In “Google USM: Expansion of automatic speech recognition beyond 100 languages”, we show that using a large unlabeled multilingual dataset to pretrain the model coder and fitting a smaller labeled dataset allows us to recognize underrepresented languages. Also, our model training process is effective in adapting to new languages and data.
A sample of the languages that USM supports. |
Challenges in current ASR
To achieve this ambitious goal, we must address two major challenges in ASR.
First, there is a lack of scalability with conventional supervised learning approaches. A fundamental challenge of scaling speech technologies to many languages is obtaining enough data to train high-quality models. With conventional approaches, audio data must either be manually tagged, which is time consuming and expensive, or collected from sources with pre-existing transcriptions, which are harder to find for languages that lack wide representation. Unlike, self-supervised learning you can take advantage of audio-only data, which is available in much larger quantities in all languages. This makes self-monitoring a better approach to achieve our goal of scaling in hundreds of languages.
Another challenge is that the models must improve in a computationally efficient manner as we expand the coverage and quality of the language. This requires that the learning algorithm be flexible, efficient, and generalizable. More specifically, such an algorithm should be able to use large amounts of data from a variety of sources, allow model updates without the need for complete retraining, and generalize to new languages and use cases.
Our approach: self-monitored learning with fine tuning
USM uses the standard encoder-decoder architecture, where the decoder can be CTC, RNN-Teither THE. For the encoder, USM uses the shaper, or convolution augmented transformer. The key component of the Shaper is the Shaper block, which consists of attention, feedbackand convolutional modules. It takes as input the log-mel spectrogram of the speech signal and performs convolutional subsampling, after which a series of Shaper blocks and a projection layer are applied to obtain the final keystones.
Our training flow starts with the first step of self-paced learning in voice audio covering hundreds of languages. In the optional second step, model quality and language coverage can be improved through an additional pretraining step with text data. The decision to incorporate the second step depends on whether text data is available. USM works best with this optional second step. The last step in the training pipeline is to tune subsequent tasks (for example, ASR or machine speech translation) with a small amount of supervised data.
For the first step, we use BEST-RQwhich has already shown advanced results in multilingual tasks and has proven efficient when using very large amounts of unsupervised audio data.
In the second (optional) step, we use supervised pre-training with multiple objectives to incorporate additional text data insights. The model introduces an additional encoder module to take text as input and additional layers to combine the output of the vocoder and text encoder, and trains the model jointly on untagged speech, tagged speech, and text data.
In the last stage, USM adjusts itself in subsequent tasks. The general training flow is illustrated below. With the knowledge gained during pre-training, USM models achieve good quality with only a small amount of supervised data from post-tasks.
USM general training line. |
key results
Multi-language performance on YouTube subtitles
Our coder incorporates more than 300 languages through prior training. We demonstrate the effectiveness of the pretrained encoder through adjustments to YouTube Caption’s multilingual speech data. YouTube’s monitored data includes 73 languages and averages less than three thousand hours of data per language. Despite limited supervised data, the model achieves less than 30% word error rate (WER; the lower the better) on average across all 73 languages, a milestone we’ve never reached before. For en-US, USM has a relatively lower WER of 6% compared to the current last generation internal model. Lastly, we compare to the recently released large model, Whisper (big-v2), which was trained on over 400k hours of labeled data. For comparison, we only use the 18 languages that Whisper can successfully decode with less than 40% WER. Our model has, on average, a relatively lower WER of 32.7% compared to Whisper for these 18 languages.
USM supports all 73 languages in YouTube’s subtitles test suite and outperforms Whisper in the languages it can support with less than 40% WER. Lower WER is better. |
Generalization to downstream ASR tasks
In publicly available data sets, our model shows lower WER compared to Whisper on CORAL (Afro-American Vernacular English), SpeechStew (en-US), and FLOWERS (102 languages). Our model achieves a lower WER with and without training on domain data. The comparison in FLEURS reports the subset of languages (62) that overlaps with the languages supported by the Whisper model. For FLEURS, USM without data in the domain has a relatively lower WER of 65.8% compared to Whisper and has a relatively lower WER of 67.8% with data in the domain.
Comparison of USM (with or without data in domain) and Whisper results on ASR benchmarks. Lower WER is better. |
Automatic Speech Translation (AST) performance
For voice translationwe adjust USM in the CoVoST data set. Our model, which includes text through the second stage of our pipeline, achieves state-of-the-art quality with limited supervised data. To assess the breadth of model performance, we segmented the languages of the CoVoST dataset into high, medium, and low based on resource availability and calculated the BLUE score (higher is better) for each segment. As shown below, USM outperforms Whisper in all segments.
CoVoST BLEU score. BLEU higher is better. |
Towards 1,000 languages
The development of USM is a fundamental effort to realize Google’s mission to organize the world’s information and make it universally accessible. We believe that USM’s base model architecture and training process provide a foundation on which we can expand speech modeling to the next 1,000 languages.
Learn more
Verify our role here. Researchers can request access to the USM API here.
Thanks
We thank all co-authors for contributing to the project and article, including Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran, Bo Li, Chung-Cheng Chiu, Daniel Park, Frances Beaufays, Hagen Soltau, Gary Wang, Ginger Perng, James Qin, Jason Riesa, Johan Schalkwyk, Ke Hu, Nanxin Chen, Parisa Haghani, Pedro Moreno Mengibar, Rohit Prabhavalkar , Tara Sainath, Trevor Strohman, Vera Axelrod, Wei Han, Yonghui Wu, Yongqiang Wang, Yu Zhang, Zhehuai Chen, and Zhong Meng.
We also thank Alexis Conneau, Min Ma, Shikhar Bharadwaj, Sid Dalmia, Jiahui Yu, Jian Cheng, Paul Rubenstein, Ye Jia, Justin Snyder, Vincent Tsang, Yuanzhong Xu, Tao Wang for helpful discussions.
We appreciate the valuable comments and support from Eli Collins, Jeff Dean, Sissie Hsiao, Zoubin Ghahramani. A special thanks to Austin Tarango, Lara Tumeh, Amna Latif, and Jason Porta for their guidance on Responsible AI practical We thank Elizabeth Adkison, James Cokerille for helping us name the model, Tom Small for the animated graphic, Abhishek Bapna for editorial support, and Erica Moreira for resource management. We thank Anusha Ramesh for her feedback, guidance, and assistance with publication strategy, and Calum Barnes and Salem Haykal for their valuable input.