In this paper, we present a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employs a named entity recognition model to extract entities from the text and uses a CLIP model to select the correct entities as labels from the paired image. The approach is simple and can be easily extended to billions of image-text pairs extracted from the web, through which we have successfully created a dataset with 2 million distinct entities. We study new training approaches on the new dataset collected with large-scale entity labels, including supervised pre-training, contrastive pre-training, and multi-task learning. Experiments show that supervised pre-training with large-scale entity labels is very effective for image retrieval tasks, and multi-task training can further improve performance. The final model, called \textbf{MOFI}, achieves 83.59% mAP on the challenging GPR1200 dataset, compared to the previous 67.33% for OpenAI's CLIP model. Additional experiments on linear probe and zero-shot image classification tasks also show that our MOFI model outperforms a CLIP model trained on the original image and text data, demonstrating the effectiveness of the new dataset in learning representations of general purpose images.