In the field of artificial intelligence and Machine Learning, speech recognition models are transforming the way people interact with technology. These models based on the powers of natural language processing, natural language understanding and natural language generation have paved the way for a wide range of applications in almost all industries. These models are essential to facilitate smooth communication between humans and machines, as they are designed to translate spoken language into text.
In recent years, exponential advances and growth have been achieved in voice recognition. OpenAI models like the Whisper series have set a good standard. OpenAI introduced the Whisper series of audio transcription models in late 2022 and these models have gained popularity and a lot of attention among the ai community, from students and academics to researchers and developers.
The pre-trained Whisper model, which has been created for speech translation and automatic speech recognition (ASR), is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. It was trained on a large data set with 680,000 hours of labeled speech data and exhibits an exceptional ability to generalize across many data sets and domains without the need for fine-tuning.
The Whisper model stands out for its adaptability, since it can be trained with both multilingual and English-only data. English-only models anticipate transcriptions in the same language as the audio, focusing on speech recognition work. On the other hand, multilingual models are trained to predict transcriptions in a language other than audio for both speech recognition and speech translation. This dual capability allows the model to be used for multiple purposes and increases its adaptability to different linguistic environments.
Important variations of the Whisper series include Whisper v2, Whisper v3, and Distil Whisper. Distil Whisper is an improved version trained on a larger data set and is a more simplified version with faster speed and smaller size. When examining the overall word error rate (WER) of each model, a seemingly paradoxical finding becomes evident: larger models have noticeably larger WER than smaller ones.
A thorough evaluation revealed that the cause of this mismatch is the multilingualism of the large models, which often leads them to misidentify the language based on the speaker’s accent. After removing these erroneous transcripts, the results become clearer. Studies showed that the large V2 and V3 models reviewed have the lowest WER, while the Distil models have the highest WER.
English-adapted models periodically avoid transcription errors in languages other than English. By having access to a larger audio data set, in terms of language misidentification rate, the large v3 model has been shown to outperform its predecessors. When evaluating the Distil model, although it demonstrated good performance even when used on different speakers, there are some more findings, which are as follows.
- Distil models may not recognize successive sentence segments, as evidenced by poor length relationships between output and label.
- Distil models sometimes perform better than the basic versions, especially when it comes to punctuation insertion. In this sense, the medium Distil model stands out especially.
- Basic Whisper models can omit verbal repetitions by the speaker, but this is not observed in Distil models.
Following a recent twitter thread by Omar Sanseviero, here is a comparison of the three Whisper models and a elaborate discussion which model should be used.
- Whisper v3 – Optimal for known languages: If the language is known and the language identification is reliable, it is better to go for the Whisper v3 model.
- Whisper v2: Robust for unknown languages: Whisper v2 shows improved reliability if the language is unknown or if Whisper v3’s language identification is unreliable.
- Whisper v3 Large: English Excellence: Whisper v3 Large is a good default option if the audio is always in English and memory or inference performance is not an issue.
- Distilled Whisper – Speed and efficiency: Distilled Whisper is a better choice if memory or inference performance is important and the audio is in English. It’s six times faster, 49% smaller, and operates within 1% WER of Whisper v2. Even with occasional challenges, it works almost as well as the slower ones.
In conclusion, Whisper models have significantly advanced the field of audio transcription and can be used by anyone. The decision to choose between Whisper v2, Whisper v3 and Distilled Whisper depends entirely on the particular requirements of the application. Therefore, an informed decision requires careful consideration of factors such as language identification, speed, and model efficiency.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>