Self-monitoring and natural language monitoring have emerged as two exciting ways to train general-purpose image coders who excel at a variety of downstream tasks. Recent works as M3AE [31] and SLIDE [64] have suggested that these approaches can be effectively combined, but most notably, their results use small pre-training data sets (100 million samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when training with a much larger amount of data. We found that a combination of two state-of-the-art approaches: masked autoencoders, MAE [38] and contrastive language image pre-training, CLIP [68] provides a benefit over CLIP when trained on a corpus of 11.3 million image-text pairs, but little or no benefit (as assessed on a set of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides much-needed clarity on the effectiveness (or lack thereof) of self-monitoring for large-scale image and text training.