This document deepens the challenging task of the active detection of the speakers (ASD), where the system needs to determine in real time whether or not a series of video frames. Although the previous works have made significant advances in the improvement of network architectures and the learning of effective representations for ASD, there is a critical gap in the exploration of the implementation of the system in real time. Existing models often suffer from high latency and memory use, which is recently practical for immediate applications. To close this gap, we present two scenarios that address the key challenges raised by real -time limitations. First, we present a method to limit the number of future context frameworks used by the ASD model. In doing so, we relieve the need to process the entire sequence of future frameworks before making a decision, significantly reducing latency. Secondly, we propose a stricter restriction that limits the total number of previous frames that the model can access during inference. This addresses persistent memory problems associated with the execution of ASD transmission systems. Beyond these theoretical frameworks, we carry out broad experiments to validate our approach. Our results show that restricted transformers models can achieve comparable performance or even better than the latest recurring models, such as unidirectional fats, with a significantly reduced number of context frames. In addition, we shed light on the temporal memory requirements of ASD systems, revealing that the largest past context has a deeper impact on precision than the future context. When it is outlined in a CPU, we find that our efficient architecture is linked by the amount of past context that it can use and that the calculation cost is insignificant compared to the cost of memory.