LLMs such as GPT, Gemini and Claude have achieved notable performance, but remain proprietary and limited training details are disclosed. Open source models like LLaMA-3 have provided weights, but need more transparency in data and training methods. Efforts to create fully transparent LLMs, such as Pythia, Amber, and OLMo, aim to improve scientific research by sharing more details, including pre-training data and training codes. Despite these efforts, open source LLMs still need to catch up compared to state-of-the-art models in tasks such as reasoning, cognition, and coding. Greater transparency is crucial to democratize LLM development and promote academic research.
Researchers from MAP, the University of Waterloo, Wuhan ai Research and 01.ai have released MAP-Neo, a transparent and highly capable bilingual language model with 7 billion parameters, trained on 4.5 trillion high-quality tokens. This fully open source model matches the performance of leading closed source LLMs. The release includes the cleaned pre-training corpus, data cleaning process, checkpoints, and an optimized training and testing framework. Comprehensive documentation covers data curation, model architecture, training processes, assessment codes and insights into LLM creation, aiming to support and inspire the global research community, especially in regions where do not speak English.
(Featured Article) LLMWare.ai Selected for GitHub 2024 Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small, Specialized Language Models
The advancement of open source LLMs is crucial for ai research and applications. Recent efforts focus on improving both performance and transparency. MAP-Neo-7B stands out for integrating intermediate checkpoints, a comprehensive data cleaning process, accessible pre-training corpus and replay code, unlike the Mistral, LLaMA3, Pythia, Amber and OLMo models. MAP-Neo-7B excels in benchmarks of Chinese and English comprehension (C-EVAL, MMLU), mathematical ability (GSM8K), and coding (HumanEval). Achieve high scores on all tests and set a new standard for transparency and performance, promoting trustworthiness and collaboration in the research community.
The tokenizer is trained using Byte Pair Encoding (BPE) via SentencePieza on 50 billion samples, with a limit length of 64,000. Priority is given to code, math, and academic data. The vocabulary size is 64,000 with a maximum sentence length of 16 to improve Chinese performance. Numbers are tokenized as single digits and unknown UTF-8 characters are returned to byte granularity. No normalization or dummy prefixes are applied, keeping character coverage at 99.99%. Removal of extra whitespace is disabled to preserve code formatting and improve performance after addressing initial training issues. The efficiency of the tokenizer varies across different languages and data sources.
The MAP-Neo model family exhibits impressive performance in all benchmark tests for the base and chat models. Particularly excels at coding, math, and following instructions. MAP-Neo outperforms other models on standard benchmarks, demonstrating its academic and practical value. The base model's high-quality data contributes to its superior results on complex reasoning tasks. Compared to other transparent LLMs, MAP-Neo shows significant progress. The effectiveness of Iterative DPO is evident, with substantial improvements in chat-related benchmarks. However, the limited capabilities of certain base models restrict their performance in instruction-based chat benchmarks.
In conclusion, data colonialism is a concern as companies exploit algorithms, leading to the manipulation of human behavior and market dominance. The concentration of ai capabilities in large technology companies and elite universities highlights the need to democratize access to ai to counter data colonialism. While open source models offer an alternative, they often require full transparency in development processes, making trust and reproducibility difficult. The MAP-Neo model addresses these issues by being an open source bilingual LLM that details all key processes. This transparency can reduce implementation costs, particularly for Chinese LLMs, promoting the inclusion of innovation and mitigating the dominance of English LLMs.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 43k+ ML SubReddit | Also, check out our ai Event Platform
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>