Introduction
In natural language processing (NLP), the development of large language models (LLM) has proven to be a transformative and revolutionary endeavor. These models, equipped with massive parameters and trained on extensive data sets, have demonstrated unprecedented proficiency in many NLP tasks. However, the exorbitant costs of training these models from scratch have led researchers to explore alternative strategies. A pioneering strategy that has emerged to enhance the capabilities of large language models (LLMs) is knowledge fusion, a concept explored in depth in the research paper Knowledge “Fusion of Large Language Models” by Wan, Huang, Cai , Quan et al.
Recognizing the need to address redundancy in the functionalities of newly developed LLMs, this innovative approach offers a compelling solution. The article delves into the intricate process of fusing knowledge from multiple LLMs, presenting a promising avenue to refine and amplify the performance of these linguistic models.
The fundamental idea is to combine the strengths and capabilities of existing LLMs, transcending the limitations of individual models. By fusing existing pre-trained LLMs, we can create a more powerful model that overcomes the individual strengths of each source model.
Understanding the fusion of knowledge from LLMs
The paper begins by highlighting the challenges and costs of training LLMs from scratch. The authors propose knowledge fusion as an efficient and profitable alternative. Instead of merging weights directly, the approach focuses on externalizing the collective knowledge of the source LLMs and transferring it to a target model. The research presents FUSELLM, a method that leverages the generative distributions of source LLMs, with the goal of improving the capabilities of the target model beyond any individual source LLM.
The primary objective of LLM merger is to externalize the inherent knowledge embedded in multiple source LLMs and integrate their capabilities into a target LLM. The article emphasizes stimulating LLMs to manifest knowledge by predicting the next token in a given text. Probabilistic distributions generated by different source LLMs for the same text are merged into a single representation, creating a unified probabilistic understanding about the text.
Implementation Details: Token Alignment and Merger Strategies
The paper presents two crucial implementation details to ensure effective knowledge fusion: token alignment and fusion strategies.
Token alignment is achieved using a Minimum Edit Distance (MinED) strategy, which improves the success rate of token alignment from different LLMs.
The fusion strategies, namely MinCE and AvgCE, evaluate the quality of different LLMs and assign different levels of importance to their distribution matrices based on cross-entropy scores.
Experiments and evaluation
The research conducts experiments in a challenging LLM merger scenario, where the source models exhibit minimal commonalities. Three representative open source models: Flame-2, OpenLLaMA and MPT: are selected as the source LLM for the merger, with another Llama-2 as the destination LLM. The experiments cover benchmarks that evaluate reasoning, common sense, and code generation capabilities.
Performance on different benchmarks
Comprehensive evaluation of FUSELLM's performance on various benchmarks provides valuable insights into its effectiveness. Table 1 shows the overall results of FUSELLM compared to the benchmark methods on Big-Bench Hard (BBH). In particular, FUSELLM demonstrates an average relative performance gain of 5.16% over the original Llama-2 across all 27 tasks. Specific tasks, such as Hyperbaton, show substantial improvements, underscoring FUSELLM's ability to leverage collective knowledge to improve performance.
Moving to the Common Sense (CS) benchmark in Table 2, FUSELLM consistently outperforms the baselines across all tasks, achieving a 1.25% relative performance improvement over Llama-2. This trend holds true even in challenging tasks like ARC-challenge and OpenBookQA, where FUSELLM exhibits significant improvements, highlighting its effectiveness in addressing complex problems.
In the context of code generation, Table 3 illustrates the zero performance of FUSELLM on the MultiPL-E (ME) benchmark. Outperforming Llama-2 on 9 out of 10 tasks, FUSELLM shows a notable improvement in pass@1 score, particularly for specific programming languages like R. Despite a performance gap compared to OpenLLaMA or MPT, FUSELLM still achieves a notable average performance gain of 6.36%, surpassing the 1.37% improvement seen in Llama-2 CLM.
Fused Probabilistic Distributions: Accelerating Optimization
A crucial aspect of FUSELLM's success lies in its ability to use fused probabilistic distributions from multiple LLMs. Figure 2 compares the few-shot Chain of Thought (CoT) performance between Llama-2 CLM and FUSELLM with different scales of training data on BBH. FUSELLM significantly improves the exact match (EM) accuracy by 2.5%, achieving the best performance of Llama-2 CLM within 520 million tokens. This represents a 3.9x reduction in token requirements compared to the Llama-2 CLM, indicating that the probabilistic distributions derived from the LLMs contain more learnable knowledge than the original text sequences, accelerating the optimization process.
Analysis of the implementation process
Drilling down into the implementation details of FUSELLM reveals critical considerations for its success. The number of source LLMs, token alignment criteria, and the choice of merging function play critical roles in shaping the performance of FUSELLM.
- Origin LLM number: Table 4 demonstrates the performance improvement of FUSELLM with different model numbers. The results show an apparent improvement as the number of models increases from 1 to 3, with consistent improvements observed in BBH.
- Criteria for token alignment: Proper token alignment is crucial during the LLM merger. The proposed MinED method consistently outperforms the EM method, demonstrating the effectiveness of MinED in aligning tokens from multiple models.
- Fusion function: The choice of fusion function is critical and FUSELLM with MinCE consistently outperforms AvgCE on all benchmarks. This emphasizes the importance of the merger function in preserving the distinct advantages of individual LLMs.
FUSELLM versus knowledge distillation and ensemble/fusion
Comparative analyzes with traditional techniques such as knowledge distillation and ensemble/fusion shed light on the unique strengths of FUSELLM.
- FUSELLM versus knowledge distillation: FUSELLM outperforms knowledge distillation, especially at BBH, where the improvement achieved by FUSELLM (5.16%) exceeds the modest gain from knowledge distillation (2.97%). This highlights FUSELLM's ability to leverage the collective knowledge of multiple LLMs more effectively.
- FUSELLM vs. Set/Fusion: In scenarios where multiple LLMs originated from the same base model but were trained on different corpora, FUSELLM consistently achieves the lowest average perplexity across three domains compared to ensemble and weight fusion methods. This reinforces the potential of FUSELLM to leverage collective knowledge more effectively than traditional fusion methods.
Also Read: Knowledge Distillation: Theory and Case Study from Start to Finish
You can find the code, model weights and public data here: GitHub FUSELLM
Conclusion: Reveal future possibilities
The paper concludes with compelling results, showing the effectiveness of FUSELLM over individual source LLMs and established baselines. The study opens a promising avenue for future exploration in LLM fusion. The findings emphasize the potential of combining the various capabilities and strengths of structurally different LLMs, shedding light on a cost-effective and powerful approach to developing large language models.
Fusion of insights from large language models is an innovative solution in a world where demand for advanced natural language processing capabilities continues to increase. This research paves the way for future efforts in creating unified models that leverage the collective intelligence of diverse LLMs, pushing the boundaries of what can be achieved in the realm of natural language understanding and generation.
I'm looking forward to your thoughts on large language model (LLM) knowledge fusion. Feel free to share your thoughts on any other informative and noteworthy articles you have come across in the comments section.
Also Read: A Complete Guide to Tuning Large Language Models