Introducing Whisper

We have trained and are opening a neural network called Whisper that approaches human-level robustness and accuracy in English speech recognition.

read paper

see code

See card model

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual, multitasking supervised data collected from the web. We show that using such a large and diverse data set leads to greater robustness to accents, background noise, and technical language. In addition, it allows transcription in multiple languages, as well as translation of those languages into English. We are open source models and inference code that serve as the foundation for building useful applications and for future research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder transformer. The input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed to an encoder. A decoder is trained to predict the corresponding text subtitle, interspersed with special tokens that direct the unique model to perform tasks such as language identification, sentence-level timestamps, multilingual speech transcription, and English speech translation.

Other existing approaches frequently use smaller and more closely matched audio and text training data sets, or use extensive but unsupervised audio pretraining. Because Whisper was trained on a large and diverse data set and not fitted to any specific one, it does not outperform models that specialize in the performance of LibriSpeech, a famous competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse data sets, we find that it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is not in English, and is alternately tasked with transcribing in the original language or translating into English. We found this approach to be particularly effective for learning speech-to-text translation and outperforms the SOTA supervised in CoVoST2 for zero-throw English translation.

We hope that Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much broader set of applications. review the paper, card modelY code for more details and to try out Whisper.

Introducing Whisper

Technical Terrence Team

Australian state says coal miners must keep up to 10% for local needs By Reuters

Leave a Reply Cancel reply

Recommended.

Cathie Wood Acquires 300,000 Air Taxi Stocks Jim Cramer Said to 'Sell, Sell, Sell'

Starbucks visionary Howard Schultz steps down from the board as expected

To address climate anxiety, consider how students educate themselves on the topic

BlackRock Global Allocation Fund Reveals Major Bitcoin ETF Holding With 43,000 Shares

LG presents Google-compatible classroom solutions at ISTELIVE 24

Categories

Important Links

Introducing Whisper

Related

Technical Terrence Team

Australian state says coal miners must keep up to 10% for local needs By Reuters

Leave a Reply Cancel reply

Recommended.

Cathie Wood Acquires 300,000 Air Taxi Stocks Jim Cramer Said to 'Sell, Sell, Sell'

Starbucks visionary Howard Schultz steps down from the board as expected

To address climate anxiety, consider how students educate themselves on the topic

BlackRock Global Allocation Fund Reveals Major Bitcoin ETF Holding With 43,000 Shares

LG presents Google-compatible classroom solutions at ISTELIVE 24

Categories

Important Links

Get daily news updates to your inbox!