On-device virtual assistants powered by automated speech recognition (ASR) require effective knowledge integration for challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side retrieval of queries from spoken information domains using several categories of language models (N-Gram word language models, subword neural LMs). . We investigate the combination of on-device and server-side signals and demonstrate significant improvements in WER of 23% to 35% on several subpopulations of entity-centric queries by integrating multiple server-side LMs compared to performing ASR only on the device.
We also perform a comparison between LMs trained with domain data and a GPT-3 variant offered by OpenAI as a base.
Furthermore, we also show that fusing server-side multiple LM models trained from scratch more effectively combines the complementary strengths of each model and integrates knowledge learned from domain-specific data into a VA ASR system.