Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed to a voice assistant and parallel conversations or background speech. Next-generation DDSD systems use verbal cues (e.g., acoustic, text, and/or automatic speech recognition (ASR) features) to classify speech as directed by a device or not, and often have to deal with one or more of these modalities are not available when implemented in real-world environments. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. At the same time, we studied the use of nonverbal cues, specifically prosodic features, in addition to verbal cues for DDSD. We present different approaches for combining scores and prosody embeddings with corresponding verbal cues, and find that prosody improves the performance of DDSD by up to 8.5% in terms of false acceptance (FA) rate at a fixed operating point determined by a nonlinear intermediate fusion. while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.