This paper was accepted into the IEEE Spoken Language technology (SLT) 2024 Workshop.
In this paper, we propose an algorithm to optimize a byte-level representation for end-to-end (E2E) automatic speech recognition (ASR). Byte-level representation is often used in large-scale multilingual ASR systems when the character set of supported languages is large. The compactness and universality of byte-level representation allows ASR models to use a smaller output and thus provides more flexibility. UTF-8 is the most widely used byte-level representation and has been successfully applied to ASR. However, it is not designed for ASR or any machine learning task. By using an autoencoder and vector quantization, we demonstrate that we can optimize a byte-level representation for ASR and achieve higher accuracy. Our proposed framework can incorporate information from different modalities and provide an error correction mechanism. In an English/Mandarin dictation task, we demonstrate that the bilingual ASR model built using this approach can outperform the UTF-8 representation by 5% in terms of relative error rate.