Pre-trained model representations have demonstrated leading-edge performance in speech recognition, natural language processing, and other applications. Speech models, such as bidirectional encoder transformer (BERT) representations and BERT (HuBERT) hidden units, have enabled the generation of lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pretrained model representations to estimate dimensional emotions, such as arousal, valence, and dominance, from speech. We note that while valence may be highly dependent on lexical representations, arousal and dominance are primarily dependent on acoustic information. In this work, we use multimodal fusion representations of pretrained models to generate a state-of-the-art speech emotion estimate, showing a 100% and 30% relative improvement in the correlation coefficient of agreement (CCC) on the estimate. From Valencia. compared to standard acoustic and lexical baselines. Finally, we investigate the robustness of the pretrained model representations against noise degradation and reverberation and note that the lexical and acoustic representations are affected differently. We found that lexical representations are more resistant to distortions compared to acoustic representations, and we show that knowledge distillation from a multimodal model helps improve the noise resistance of acoustic-based models.