In this work, we present and evaluate Selma, a language model enabled for speech for virtual assistant interactions that integrates audio and text as entries to a large language model (LLM). Selma is designed to handle three primary tasks and two assistants related to interactions with virtual assistants simultaneously within a single end -to -end model. We use low-ranking adaptation modules for training the audio coder and LLM. In addition, we implement a strategy of grouping characteristics that allows the system to recognize global patterns and improve the precision in the less that depend on individual sequence elements. The experimental results in the detection of voice triggers (VT), the detection of speech aimed at the device (DDSD) and automatic voice recognition (ASR) show that our approach simplifies the typical input processing pipe of virtual assistants in a significant way and also improves performance compared to the models dedicated for each individual task. Selma produces relative improvements of equal error rate of 64% in the VT detection task, and 22% in DDSD, while reaches words error rates near the baseline.