In the changing landscape of natural language processing (NLP), the ability to capture and process large textual contexts is paramount. Recent advances, as highlighted by Lewis et al. (2021), Izacard et al. (2022) and Ram et al. (2023), have significantly boosted the capabilities of language models, particularly through the development of text embeddings. These additions serve as the backbone for a host of applications, including augmented retrieval generation for large language models (LLMs) and semantic search. They transform sentences or documents into low-dimensional vectors, capturing the essence of semantic information, which in turn facilitates tasks such as grouping, classification, and information retrieval.
However, one obvious limitation has been the extent of context that these models can handle. Most of the widely recognized open source models in the MTEB benchmark, such as the E5 by Wang et al. (2022), GTE by Li et al. (2023) and BGE by Xiao et al. (2023), are limited to a context length of 512 tokens. This restriction undermines its usefulness in scenarios where understanding the broader context of the document is crucial. In contrast, models capable of exceeding a context length of 2048, such as Voyage-lite-01-instruct by Voyage (2023) and text-embedding-ada-002 by Neelakantan et al. (2022), remain behind closed doors.
In this context, the introduction of text-nomicembed-v1 marks a significant milestone. This model is not only open source, but also boasts an impressive sequence length of 8192, outperforming its predecessors in both short- and long-term context evaluations. What sets it apart is its comprehensive approach, combining the strengths of open weights, open data, and a 137 million parameter design under an Apache-2 license, ensuring accessibility and transparency.
The path to achieving such a feat involved meticulous stages of data preparation and model training. Initially, a Masked Language Modeling Pre-Training The phase used resources such as BooksCorpus and a Wikipedia dump from 2023, employing the bert-base-uncased tokenizer to create data chunks suitable for training in long contexts. This was followed by Unsupervised contrastive pretrainingleveraging a vast collection of 470 million pairs across diverse data sets to refine model understanding through consistent filtering and selective embedding.
The architecture of nomicembed-text-v1 reflects careful adaptation of BERT to accommodate the extended length of the sequence. Innovations such as rotating positional embeddings, SwiGLU activation, and Flash Attention integration highlight a strategic overhaul to improve performance and efficiency. The model training regime, characterized by a 30% masking rate and optimized settings, further underlines the rigorous effort to achieve optimal results.
When subjected to the rigors of benchmarks such as GLUE, MTEB, and specialized long-context assessments, nomicembed-text-v1 demonstrated exceptional prowess. It is worth highlighting his performance in the JinaAI long context benchmark and the LoCo benchmark underlines its superiority in handling long texts, an area in which many predecessors failed.
However, nomicembed-text-v1's journey extends beyond mere performance metrics. Its development process, which emphasizes end-to-end auditability and the potential for replication, sets a new standard for transparency and openness in the ai community. By publishing the model weights, the codebase, and a selected training data set, the team behind text-nomicembed-v1 invites continued innovation and scrutiny.
In conclusion, text-nomicembed-v1 It emerges not only as a technological advance but as a beacon for the open source movement in ai. It dismantles barriers to entry in the domain of long-context text incorporation, promising a future in which the depth of understanding matches the breadth of human discourse.
Review the Paper and ai/contrastors” target=”_blank” rel=”noreferrer noopener”>GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 37k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Vineet Kumar is a Consulting Intern at MarktechPost. She is currently pursuing her bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.
<!– ai CONTENT END 2 –>