BAAI introduces BGE M3-Embedding with the help of researchers from the University of Science and technology of China. The M3 refers to three novel properties of text embedding: multilingual, multifunctional, and multigranularity. It identifies major challenges in existing integration models, such as inability to support multiple languages, restrictions on retrieval capabilities, and difficulty handling varying input granularities.
Existing integration models such as Contriever, GTR, E5, and others have been shown to bring notable advances in this field, but lack language support, multiple retrieval functionality, or long input texts. These models are primarily trained only for English and support only one recovery functionality. The proposed solution, BGE M3-Embedding, supports more than 100 languages, supports various retrieval functionalities (dense, sparse and multi-vector retrieval), and processes input data ranging from short sentences to long documents handling up to 8192 tokens.
M3-Embedding involves a novel self-awareness distillation approach, which optimizes batch processing strategies for large input lengths, for which the researchers used diverse, large-scale multilingual datasets from various sources such as Wikipedia and S2ORC. It facilitates three common retrieval functionalities: dense retrieval, lexical retrieval, and multi-vector retrieval. The distillation process involves combining relevance scores from multiple retrieval functionalities to create a master signal that allows the model to perform multiple retrieval tasks efficiently.
The model is evaluated for its performance with multilingual text (MLDR), varied sequence length, and quality control narrative responses. The evaluation metric was nDCG@10 (normalized discounted cumulative gain). Experiments showed that the M3 integration model outperformed existing models in more than 10 languages, giving results on par in English. The model's performance was similar to other models with smaller input lengths, but showed improved results with longer texts.
In conclusion, M3 embedding is a significant advancement in text embedding models. It is a versatile solution that supports multiple languages, varied retrieval functionalities, and different input granularities. The proposed model addresses crucial limitations in existing methods, marking a substantial step towards information retrieval. It outperforms basic methods such as BM25, mDPR and E5, demonstrating its effectiveness in addressing the identified challenges.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.
<!– ai CONTENT END 2 –>