We have trained and are opening a neural network called Whisper that approaches human-level robustness and accuracy in English speech recognition.
see code
See card model
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual, multitasking supervised data collected from the web. We show that using such a large and diverse data set leads to greater robustness to accents, background noise, and technical language. In addition, it allows transcription in multiple languages, as well as translation of those languages into English. We are open source models and inference code that serve as the foundation for building useful applications and for future research on robust speech processing.
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder transformer. The input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed to an encoder. A decoder is trained to predict the corresponding text subtitle, interspersed with special tokens that direct the unique model to perform tasks such as language identification, sentence-level timestamps, multilingual speech transcription, and English speech translation.
Other existing approaches frequently use smaller and more closely matched audio and text training data sets, or use extensive but unsupervised audio pretraining. Because Whisper was trained on a large and diverse data set and not fitted to any specific one, it does not outperform models that specialize in the performance of LibriSpeech, a famous competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse data sets, we find that it is much more robust and makes 50% fewer errors than those models.
About a third of Whisper’s audio dataset is not in English, and is alternately tasked with transcribing in the original language or translating into English. We found this approach to be particularly effective for learning speech-to-text translation and outperforms the SOTA supervised in CoVoST2 for zero-throw English translation.
We hope that Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much broader set of applications. review the paper, card modelY code for more details and to try out Whisper.