We consider the task of animating 3D facial geometry from a speech signal. Existing works are mainly deterministic and focus on learning a one-to-one mapping from speech signal to 3D facial meshes on small data sets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they cannot capture the full and diverse distribution of 3D facial movements that accompany speech in the real world. Importantly, the relationship between speech and facial movement is one-to-many, contains variation both between and within speakers, and requires a probabilistic approach. In this article, we identify and address key challenges that have so far limited the development of probabilistic models: the lack of data sets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results without leaving of being faithful to a strong conditioning signal such as speech. We first propose large-scale reference data sets and metrics suitable for probabilistic modeling. We then demonstrate a probabilistic model that achieves both diversity and speech fidelity, outperforming other methods on the proposed benchmarks. Finally, we show useful applications of probabilistic models trained on these large-scale data sets: we can generate various speech-based 3D facial movements that match unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of later audiovisual models.