Contrastive language and image pretraining (CLIP) has been a famous method for training vision encoders to generate image/text representations to facilitate various applications. Recently, CLIP has been widely adopted as the backbone of the multimodal large language models (MLLM) vision for connecting image inputs for linguistic interactions. The success of CLIP as a basic vision and language model relies on aligning noisy text annotations crawled on the web to image levels. However, such criteria may be insufficient for downstream tasks that need detailed vision representations, especially when regional-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advancements. We propose a pretraining method called Contrastive Localized Language Image Pretraining (CLOC) by complementing CLIP with modules and contrastive region text loss. We formulate a new concept, fast embeddings, from which the encoder produces image embeddings that are easy to transform into representations of regions given spatial hints. To support large-scale pre-training, we designed a visually rich and spatially localized captioning framework to effectively generate region text pseudo-tags at scale. Scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement for CLIP to improve MLLMs, especially in reference and grounding tasks. .