Nomic ai launched an onboarding model with a multi-stage training process. ai/posts/nomic-embed-text-v1″>Embed nomic, an open source, auditable, high-performance text embedding model. It also has an extended context length that supports tasks such as retrieval augmented generation (RAG) and semantic search. Existing popular models, including OpenAI's text-embedding-ada-002, lack openness and auditability. The model addresses the challenge of developing a text embedding model that outperforms current closed source models.
Current state-of-the-art models master long-context text embedding tasks. However, its closed-source nature and lack of availability of training data for auditability pose limitations. The proposed solution, ai/nomic-embed-text-v1″>Embed nomic, provides an open source, auditable, high-performance text embedding model. Key features of Nomic Embed include a context length of 8192, reproducibility, and transparency.
Nomic Embed is built through a multi-stage contrastive learning process. It starts by training a BERT model with a context length of 2048 tokens, called nomic-bert-2048, with modifications inspired by MosaicBERT. The training involves:
- Rotary position inlays,
- SwiGLU activations,
- Deep Speed and FlashAttention,
- BF16 precision.
It used larger vocabulary size and a batch size of 4096. The model is then trained contrastively with ~235 million text pairs, ensuring high-quality labeled datasets and extraction of concrete examples. Nomic Embed outperforms existing models on benchmarks such as Massive Text Embedding Benchmark (MTEB), LoCo Benchmark, and Jina Long Context Benchmark.
Nomic Embed not only outperforms closed source models like OpenAI's text-embedding-ada-002, but also outperforms other open source models on several benchmarks. The emphasis on transparency, reproducibility, and publication of model weights, training codes, and selected data show a commitment to openness in ai development. Nomic Embed's performance on long-context tasks and the call for improved testing paradigms underscore its importance in advancing the field of text embeddings.
Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.
<!– ai CONTENT END 2 –>