The language model landscape is evolving rapidly, driven by the empirical success of scaling models with larger parameters and computational budgets. In this era of great linguistic models, Expert Mix (MoE) emerges as a key player, offering a solution to manage computational costs while scaling model parameters. However, challenges remain in ensuring expert specialization in conventional MoE architectures such as ghard, which activate the best 𝐾 among 𝑁 experts. Recent applications of MoE architectures in Transformers have shown successful attempts to scale language models to substantial sizes, accompanied by remarkable performance, underscoring the vast potential of MoE language models.
Conventional MoE architecture replaces Feedback networks (FFN) in a transformer with MoE layers, where each layer comprises multiple experts structurally identical to a standard FFN. Each token is assigned to one or two experts, creating two main challenges: knowledge hybridization and knowledge redundancy. These problems arise due to the limited number of experts, which causes the tokens assigned to a specific expert to cover diverse knowledge and, in turn, compromise the model's ability to use this information simultaneously.
In response to these challenges, a team of researchers from DeepSeek-ai proposed Deep SearchMoE—An innovative MoE architecture designed to achieve maximum expert specialization. As illustrated in Figure 2This architecture employs two main strategies: fine-grained expert segmentation and shared expert isolation.
Fine-grained expert segmentation addresses the limitation of a fixed number of experts by splitting the intermediate hidden dimension of FFN. This strategy allows for finer segmentation of experts, activating more detailed experts while maintaining a constant number of parameters and computational costs. The result is a flexible and adaptable combination of activated experts, enabling precise knowledge acquisition and higher levels of specialization. Fine-grained expert segmentation substantially improves the combinatorial flexibility of activated experts, potentially leading to more precise and targeted knowledge acquisition.
Shared expert isolation complements fine-grained segmentation by isolating specific experts as shared experts, always enabled regardless of the routing module. These shared experts aim to capture and consolidate common knowledge across diverse contexts, mitigating redundancy among other deviant experts. This isolation improves the efficiency of the parameters, ensuring that each dispatched expert retains her specialization by focusing on distinctive aspects. In particular, this shared strategy of expert isolation is inspired by Rajbhandari et al. (2022) but it is approached from an algorithmic point of view.
The article delves into the problem of load imbalance that machine-learned routing strategies may encounter, leading to risks of routing collapse and computation bottlenecks. The authors introduce loss of balance at the level of experts and devices to mitigate these risks, emphasizing the importance of a balanced calculation between devices.
The training data, obtained from a large-scale multilingual corpus from DeepSeek-ai, focuses primarily on English and Chinese, but includes other languages. For validation experiments, a subset corpus containing 100 billion tokens is sampled to train their models.
The assessment covers several benchmarks covering language modeling, language understanding, reasoning, reading comprehension, code generation, and closed-book question answering. DeepSeekMoE is rigorously benchmarked against baselines including Hash Layer, Switch Transformer, and GShard, consistently demonstrating superiority within the MoE architecture landscape.
The results of the evaluation, detailed in Table 1 and Table 2, highlights the strengths of DeepSeekMoE over other models. Noteworthy observations include the significant performance advantages of DeepSeekMoE over GShard, especially when considering sparse architectures and comparable total parameters. The paper also presents comparisons with larger GShard models and denser models, showing the scalability and efficiency of DeepSeekMoE.
Previous research on MoE models has often suggested limited gains through adjustment. However, the authors cite the findings of Shen et al. (2023) indicating that MoE models, specifically DeepSeekMoE 16B, can benefit from supervised fine-tuning. Experimental results demonstrate the adaptability and comparable performance of DeepSeekMoE Chat 16B in alignment tasks.
Encouraged by the success of DeepSeekMoE 16B, the authors embark on a preliminary exploration to extend DeepSeekMoE to 145B. In this initial study, DeepSeekMoE 145B, trained on 245B tokens, demonstrates consistent advantages over GShard and promises to match or exceed the performance of DeepSeek 67B (Dense). The authors plan to make the final version of DeepSeekMoE 145B public.
In conclusion, the article presents DeepSeekMoE as an innovative MoE language model architecture, emphasizing maximum expert specialization. Through innovative strategies, including fine-grained expert segmentation and shared expert isolation, DeepSeekMoE achieves significantly increased expert specialization and performance compared to existing MoE architectures. The scalability of DeepSeekMoE is demonstrated through experiments and the authors offer insight into its potential on an unprecedented scale of 145 billion parameters. With the public release of the DeepSeekMoE 16B model checkpoint (ai/DeepSeek-MoE”>GitHub), the authors aim to provide valuable knowledge to both academia and industry, promoting the advancement of large-scale linguistic models.
Review the Paper and ai/DeepSeek-MoE” target=”_blank” rel=”noreferrer noopener”>GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Vineet Kumar is a Consulting Intern at MarktechPost. She is currently pursuing her bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.
<!– ai CONTENT END 2 –>