In the acoustic-to-word (A2W) embedding ASR, each vocabulary word is represented by a fixed-dimensional embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution to the problem of dynamic out-of-vocabulary (OOV) words, where speaker- and context-dependent named entities, such as contact names, must be incorporated into the ASR on the fly for each utterance. of speech on the test. time. However, challenges remain to improve the overall accuracy of the embedded and matching A2W. In this paper, we contribute two methods that improve the accuracy of A2W match embedding. First, we propose to internally produce multiple embeds, rather than a single embed, at each instance in time, which allows the A2W model to propose a richer set of hypotheses about multiple time slices in the audio. Second, we propose the use of word pronunciation embeddings instead of word spelling embeddings to reduce the ambiguities introduced by words that have more than one sound. We show that the above ideas provide a significant improvement in accuracy, with the same training data and almost identical model size, in scenarios where dynamic OOV words play a crucial role. In a voice-based digital assistant query dataset that includes many user-dependent contact names, we observed up to an 18% decrease in word error rate using the proposed enhancements.