Nomic ai has recently introduced two major releases in multimodal integration models: ai/nomic-embed-vision-v1″>Nomic Integrated Vision v1 and ai/nomic-embed-vision-v1.5″>Nomic Integrated Vision v1.5. These models are designed to provide high-quality, fully replicable vision embeds that integrate seamlessly with existing Nomic Embed Text v1 and v1.5 models. This integration creates a unified embedding space that improves performance for text and multimodal tasks, outperforming competitors such as OpenAI CLIP and OpenAI Text Embedding 3 Small.
ai/posts/nomic-embed-vision”>Nomic Integrated Vision aims to address the limitations of existing multimodal models such as CLIP, which, while impressive in zero-shot multimodal capabilities, underperform on tasks outside of image retrieval. By aligning a vision encoder with the existing Nomic Embed Text latent space, Nomic has created a unified multimodal latent space that excels at image and text tasks. This unified space has shown superior performance on benchmarks such as Imagenet 0-Shot, MTEB and Datacomp, making it the first dumbbell model to achieve such results.
Nomic Embed Vision models can embed image and text data, perform unimodal semantic search within data sets, and perform multimodal semantic search across data sets. With only 92M parameters, the vision encoder is ideal for high-volume production use cases and complements the 137M Nomic Embed Text. Nomic has open sourced the training code and replication instructions, allowing researchers to reproduce and improve the models.
The performance of these models is compared to established standards, and Nomic Embed Vision demonstrates superior performance in various tasks. For example, Nomic Embed v1 achieved 70.70 on Imagenet 0-shot, 56.7 on Datacomp Avg. and 62.39 on MTEB Avg. Nomic Embed v1.5 performed slightly better, indicating the robustness of these models .
Nomic Embed Vision powers multimodal search in Atlas, showcasing its ability to understand textual queries and image content. An example query demonstrated the model's semantic understanding by retrieving images of stuffed animals from a dataset of 100,000 images and captions.
The formation of Nomic Embed Vision involved several innovative approaches to align the vision encoder with the text encoder. These included training on image-text pairs and text-only data, using a Three Towers training method and locked image text fitting. The most effective approach involved freezing the text encoder and training the vision encoder on image-text pairs, ensuring backward compatibility with Nomic Embed Text embeddings.
The vision encoder was trained on a subset of 1.5 billion image-text pairs using 16 H100 GPUs, achieving impressive results on the Datacomp benchmark, which includes 38 image classification and retrieval tasks.
Nomic has released two versions of Nomic Embed Vision, v1 and v1.5, which are compatible with the corresponding versions of Nomic Embed Text. This support allows seamless multimodal tasks across different versions. The models are released under a CC-BY-NC-4.0 license, encouraging experimentation and research, with plans to re-license under Apache-2.0 for commercial use.
In conclusion, Nomic Embed Vision v1 and v1.5 transform multimodal embeddings, providing a unified latent space that excels in image and text tasks. With open source training codes and a commitment to continuous innovation, Nomic ai sets a new standard in model integration and offers powerful tools for diverse applications.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.