This paper was accepted into the Industry Track at NAACL 2024.
With increasingly powerful computing capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has moved from the cloud to devices to better protect user privacy. However, it is still a challenge to implement ASR on resource-constrained devices such as smartphones, smart wearables, and other small home automation devices. In this paper, we propose a series of model architecture adaptations, neural network graph transformations, and numerical optimizations to adapt an advanced Conformer-based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve speech recognition more than 5.26 times faster than real-time (0.19 RTF) on small wearable devices, while minimizing power consumption and achieving next-generation accuracy. The proposed methods are widely applicable to other transformer-based serverless ai applications. Furthermore, we provide a complete theory on optimal prenormalizers that numerically stabilize layer normalization at any Lp norm using any floating-point precision.