Using a vision-inspired keyword detection framework, we propose an architecture with input-dependent dynamic depth capable of processing audio streaming. Specifically, we extend a Conformer encoder with trainable binary gates that enable dynamically bypassing network modules based on input audio. Our approach improves detection and localization accuracy in continuous speech using the top 1000 most frequent words from Librispeech while maintaining a small memory footprint. The inclusion of gates also allows the average amount of processing to be reduced without affecting overall performance. These benefits have been shown to be even more pronounced using Google voice commands placed above background noise, where up to 97% of processing is skipped on non-speech inputs, making our method particularly interesting. for an always-on keyword watcher.