Oxford Researchers Open-Source WhisperX: A Time-Precise Speech Recognition System With Word-Level Timestamps

Weakly supervised and unsupervised training approaches have shown outstanding performance in various audio processing tasks, including speech recognition, speaker recognition, speech separation, and keyword detection, thanks to the availability of large-scale online data sets. Oxford researchers developed a speech recognition system called Whisper that makes use of this extensive database on a larger scale. Using 125,000 hours of English translation data and 680,000 hours of noisy speech training data in an additional 96 languages, they demonstrate how weakly supervised pretraining of a simple codec transformer can successfully achieve multilingual speech transcription. zero trip at set reference points.

Most academic benchmarks are made up of short statements, but real-world applications such as meetings, podcasts, and videos typically require long-form audio transcription that can last for hours or minutes. Due to memory limitations, the transformer designs used for Automatic Speech Recognition (ASR) models avoid transcription of arbitrarily long input audio (up to 30 seconds in the case of Whisper). Recent research uses sliding window-style heuristic approaches, which are prone to errors due to audio overlap, which can cause inconsistent transcriptions when the model processes the same speech twice; and (ii) incomplete audio, where some words may be lost or incorrectly transcribed if they are at the beginning or end of the input segment.

Whisper suggests a buffered transcription method that relies on accurate timestamp prediction to set how much the input window should be shifted. Since timestamp errors in one window can add up to errors in successive windows, such a solution is vulnerable to significant deviation. They try to eliminate these errors using a variety of handcrafted heuristics, but their efforts are often unsuccessful. Whisper’s linked decoding, which uses a single encoder-decoder to decode transcripts and timestamps, is susceptible to standard problems with autoregressive language production, specifically hallucination and repetition. This disastrously affects buffered transcription of full-length activities and other timestamp-sensitive activities such as speaker diarization, lip reading, and audiovisual learning.

Read our latest AI newsletter

According to the whisper paper, a significant part of the training corpus comprises incomplete data (audio transcript pairings without timestamp information), represented by the token |nottimestamps|>. When scaling on incomplete and noisy transcription data, speech transcription performance is inadvertently traded off by less accurate timestamp prediction. As a result, when employing add-on modules, transcription and speech must align precisely. There’s a ton of effort put into “force alignment,” which aligns the speech transcript with audio waveforms at the word or phoneme level. Acoustic phone models are often trained to use the Hidden Markov Model (HMM) framework and the byproduct of possible state alignments.

The timestamps of these words or phone numbers are often corrected by outer bounds correction models. Some recent studies use deep learning tactics for forced alignment, such as employing a two-way attention matrix or CTC segmentation with the end-to-end trained model due to the rapid growth of deep learning-based approaches. Combining a state-of-the-art ASR model with a simple phoneme recognition model, both prepared using large-scale, meaningful data sets, could result in further improvement.

To overcome these difficulties, they suggest WhisperX, a method for effective speech transcription of long-form audio with precise word-level timestamps. It includes three additional steps in addition to whisper transcription:

Pre-segmentation of input audio with an external voice activity detection (VAD) model.
Slice and merge the resulting VAD segments into approximately 30 second input chunks with caps on minimally active speech regions.
They force alignment with an external phoneme model to provide accurate word-level timestamps.

review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.