End-to-end (E2E) neural networks have emerged as flexible and accurate models for multilingual automatic speech recognition (ASR). However, as the number of supported languages increases, particularly those with large character sets such as Chinese, Japanese, and Korean (CJK), the size of the output layer increases substantially. This expansion negatively impacts computing resources, memory usage, and asset size. The challenge becomes more pronounced in multilingual systems, where the output often consists of unions of characters or subwords from multiple languages. Therefore, researchers are grappling with the need to maintain model efficiency and performance while accommodating a wide range of languages and their associated character sets in E2E ASR systems.
Previous attempts to address these challenges in multilingual ASR have focused on byte-level representations, in particular using UTF-8 codewords as base tokens. This approach allows for a fixed output vocabulary size of 256, providing compactness and universality across languages. However, byte-level representations often result in longer sequences, especially for CJK languages, potentially increasing error rates as multiple predictions are required for individual characters. Researchers proposed byte-level subwords using byte pair encoding (BPE) on UTF-8 codeword sequences to mitigate this. While this reduced the number of decoding steps, it did not guarantee valid UTF-8 outputs. A dynamic programming algorithm was later introduced to recover valid characters from potentially invalid byte sequences, although this method was optimized for character validity rather than ASR quality.
The state-of-the-art method reviewed by Apple researchers Proposes a robust representation learning approach using a vector quantized autoencoder. This method aims to optimize byte level representation specifically for E2E ASR tasks, addressing the limitations of previous approaches. The framework is designed to be data-driven, incorporating both text and audio information to improve accuracy. It offers flexibility to include additional information such as lexicons or phonemes, making it adaptable to various ASR scenarios. Importantly, the method includes an error correction mechanism to handle invalid sequences, with recovery optimized for accuracy rather than other metrics. This approach aligns with the researchers’ criteria for an ideal byte-level representation: task-specific optimization, comprehensive use of information, and efficient error correction.
The proposed method formulates the representation problem as an optimization task with latent variables, using a vector quantized autoencoder (VQ-AE) architecture. This autoencoder consists of four key components: a label encoder, an acoustic encoder, a label decoder, and a vector quantizer. The system uses vector quantization as its bottleneck, with the indices of the quantized embeddings serving as latent variables.
The autoencoder is optimized using a loss function comprising four terms: cross-entropy losses for acoustic and label encoders, a CTC loss for the acoustic encoder, and a quantization loss. The method employs a residual VQ-VAE (RVQ-VAE) with two or three codebooks, each containing 256 embeddings, allowing each label token to be represented with 2–3 bytes.
To handle potential errors in byte sequences, the system incorporates an error correction mechanism via the tag decoder. This decoder estimates the most likely tag sequence, optimizing accuracy even when faced with invalid byte sequences. The proposed VQ-based representation offers advantages over UTF-8, including fixed-length encoding, task-specific optimization, and improved error recovery.
The researchers evaluated their proposed VQ-based representation approach on bilingual dictation tasks in English and Mandarin, comparing it to outputs based on UTF-8 characters and subwords. Using a CTC-AED model with approximately 120 million parameters, they tested various output representations on datasets comprising 10,000 hours of English training data and 14,000 hours of Mandarin training data.
The results showed that the VQ-based representation consistently outperformed the UTF-8 subword outputs across different subword sizes. At 8000 subwords, the VQ-based approach achieved a 5.8% relative reduction in word error rate (WER) for English and a 3.7% relative reduction in character error rate (CER) for Mandarin compared to UTF-8. Compared to the character-based output, both the VQ and UTF-8 representations performed better on English while maintaining similar accuracy for Mandarin. In particular, the VQ-based method at 8000 subwords demonstrated a 14.8% relative error rate reduction for English and a 2.3% reduction for Mandarin compared to the character-based output, highlighting its effectiveness and flexibility in multilingual ASR systems.
This study presents a robust algorithm to optimize byte-level representation in ASR, offering an alternative to UTF-8 representation. This approach can be optimized using both audio and text data, with an error correction mechanism designed to improve accuracy. Testing on English and Mandarin dictation datasets demonstrated a 5% relative reduction in token error rate (TER) compared to UTF-8-based methods. While the current study focused on bilingual ASR, the researchers acknowledge challenges in developing a universal representation for all languages, such as the problem of index collapse.
Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>