The world of AI has drastically transformed the daily lives of humans. Features like voice recognition have made it relatively easier to perform tasks like taking notes, writing documents, etc. The fact that speech recognition is faster is what makes it highly efficient. With the development of AI, speech recognition applications have expanded rapidly. Virtual assistants like Google, Alexa, and Siri implement voice recognition software to interact with users. Similarly, features like text to speech, speech to text, and text to text have also gained popularity with various apps.
Encouraged by the excellent performance of T5 (text-to-text transfer transformer) on pretrained natural language processing models, the scientists proposed a unified model SpeechT5 framework that explores coder pretraining for self-supervised speech/text.
representation learning. SpeechT5 is offered in Hugging Face Transformersan open source toolkit that provides easy implementations of state-of-the-art machine learning models.
SpeechT5 offers three different types of speech models in one architecture. Using a standard encoder-decoder framework, SpeechT5’s unified model framework enables learning of combined contextual representations for speech and text data. Their different speech models are:
- Text-to-speech: to create audio from scratch.
- Speech-to-text: to automatically recognize speech.
- Speech-to-speech: to perform voice augmentation or switch between voices.
The fundamental principle of SpeechT5 is to pretrain a model using a combination of text-to-speech, speech-to-text, text-to-text, and speech-to-speech data. In this way, the model learns simultaneously from speech and text. This pretraining method produces a model with a single hidden representation space shared by text and audio.
SpeechT5 is based on a standard transformer codec model. The encoder-decoder network simulates a sequential transformation using hidden representations, just like any other transformer. All SpeechT5 tasks share the same Transformer framework.
Adding pre-nets and post-nets allowed the same Transformer to handle text and voice data. Pre-role networks translate input text or speech into hidden Transformer representations. The subsequent network takes the outputs of the Transformer and reformats them as text or voice. To train the model for a diverse set of languages, the team feeds the model text/speech formats as input and thus generates the corresponding output in text/speech format.
text to speech: The model uses the following pre and post networks for the TTS task:
- Pre-net text encoder. A layer that translates text tokens into the hidden representations that the encoder anticipates. Comparable to what happens in an NLP model like BERT.
- Pre-net voice decoder. It takes a log mel spectrogram as input and compresses it into hidden representations using linear layers.
- Post-net voice decoder. It uses data from Tacotron 2 to predict that a residual will be added to the output spectrogram and to improve the findings.
dictation to text: The model employs the following pre- and post-networks for the speech-to-text task:
- Pre-network vocoder.
- Pre-network text decoder.
- Post-net text decoder.
voice to voice: The text-to-speech and speech-to-speech models of SpeechT5 are conceptually equivalent. Just replace the vocoder pre-network with the text scrambler prenetwork. The remaining model is similar.
Unlike other models, SpeechT5 is unique in that it allows users to perform numerous activities using the same architecture. The only thing that changes are the pre-grids and the post-grids. The model can perform each separate task more skillfully after being tuned by previously training it on these combined tasks. The suggested unified codec approach is capable of supporting generation tasks such as speech and speech conversion. Large-scale tests reveal that SpeechT5 significantly outperforms all baselines on various spoken language processing tasks. The research team will pretrain the SpeechT5 in the future with a larger model and more unlabeled data. For future work, the scientists are also interested in developing the SpeechT5 framework to solve tasks involving multilingual spoken language processing.
review the Paper and Related article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.