Conformer-based speech recognition in extreme edge computing devices

This paper was accepted into the Industry Track at NAACL 2024.

With increasingly powerful computing capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has moved from the cloud to devices to better protect user privacy. However, it is still a challenge to implement ASR on resource-constrained devices such as smartphones, smart wearables, and other small home automation devices. In this paper, we propose a series of model architecture adaptations, neural network graph transformations, and numerical optimizations to adapt an advanced Conformer-based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve speech recognition more than 5.26 times faster than real-time (0.19 RTF) on small wearable devices, while minimizing power consumption and achieving next-generation accuracy. The proposed methods are widely applicable to other transformer-based serverless ai applications. Furthermore, we provide a complete theory on optimal prenormalizers that numerically stabilize layer normalization at any Lp norm using any floating-point precision.

No Result