Language model adaptation is a crucial area in artificial intelligence, which focuses on improving large pre-trained language models to perform effectively across multiple languages. This research is vital to enable these models to understand and generate text in multiple languages, which is essential for global ai applications. Despite the impressive performance of English LLMs, their capabilities significantly decrease when adapted to less common languages, making additional adaptation techniques necessary.
One of the most significant challenges when adapting language models to new languages is catastrophic forgetting, which occurs when a model loses its proficiency in the original language while learning a new one, severely limiting its usefulness. Retaining the capabilities of the base model is essential for solving tasks in the new language, as skills such as math and coding learned in English are invaluable for problem solving and reasoning in other languages.
Current methods to address catastrophic forgetting include continuous pretraining and fine-tuning instructions with experience repetition. Experience repetition involves mixing data from the original language during training in the new language. However, this approach needs to be revised to fully mitigate forgetting, especially when the exact source data is unknown. The experience repetition approximation reduces its effectiveness, requiring further regularization to maintain model performance in the base language.
Researchers from INSAIT, LogicStar.ai, eth Zurich, the University of Chicago and Together ai presented a new approach called Branching and Merging (BAM)This method iteratively fuses multiple models, each fine-tuned on different subsets of the training data, to achieve smaller magnitude but higher quality weight changes. By combining these models, BAM reduces forgetting while maintaining learning efficiency. The BAM method splits the training data into multiple chunks and fine-tunes the base model on these chunks in parallel. The resulting models are then merged to form a new base model for the next iteration. This iterative process minimizes the total weight change, reducing the risk of catastrophic forgetting. Furthermore, by leveraging multiple training chunks, BAM ensures the retention of essential base language skills.
In detail, BAM splits the training data into N chunks and fine-tunes the base model on K (typically two) of these chunks in parallel before merging the resulting models. This significantly reduces the overall weight shift, preserving most of the learning from the parallel training steps. The research team applied BAM to fit models such as MISTRAL-7B and LLAMA-3-8B from predominantly English to Bulgarian and German. They found that BAM consistently improved baseline performance in both target and source languages compared to standard training methods. For example, LLAMA-3-8B trained with BAM improved Bulgarian task performance by 10.9% and English task performance by 1.3%, demonstrating the effectiveness of the method.
To better understand the performance of BAM, the researchers conducted an extensive empirical study. They applied BAM to fit the MISTRAL-7B and LLAMA-3-8B models, using predominantly English data, to the Bulgarian and German languages. The results showed that BAM significantly reduced forgetting while matching or improving performance in the target domain compared to standard continuous pretraining and fine-tuning instruction. Specifically, BAM enabled the LLAMA-3-8B model to outperform its standard counterpart by 10.9% on Bulgarian tasks and by 1.3% on English tasks. This improvement is attributed to the smaller magnitude but more efficient weight shifts induced by BAM.
BAM was evaluated using both approximate and minimal experience rehash. Approximate experience rehash involved a combination of 15.1 billion unique tokens from sources such as OpenWebText, English Wikipedia, and GitHub repositories. In contrast, minimal experience rehash used only 5 billion OpenWebText tokens for German and 10 billion tokens for Bulgarian. The study found that approximate experience rehash led to a larger increase in target domain performance and reduced source domain forgetting compared to minimal experience rehash.
The effectiveness of BAM was also demonstrated in instruction fine-tuning. Using 928,000 English fine-tuning data samples combined with German or Bulgarian data, BAM slightly improved learning in both target languages, while significantly reducing forgetting. For example, models trained with BAM outperformed standard instruction fine-tuning models in Bulgarian instruction fine-tuning, achieving 10.8% better performance on Bulgarian tasks and 1.3% better performance on English tasks.
In conclusion, the Branch-and-Merge (BAM) method offers a robust solution to catastrophic forgetting in language model adaptation. By ensuring minimal but effective weight shifts, the model’s capabilities in the original language are preserved and its performance in the target language is improved. This approach can significantly benefit practitioners working on multilingual ai applications as it provides a more efficient way to adapt large language models to diverse language environments. The research showed that BAM could effectively balance learning and forgetting, making it a valuable method for continuous pre-training and instruction fine-tuning in both shared-alphabet and non-shared-alphabet languages.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Nikhil is a Consultant Intern at Marktechpost. He is pursuing an integrated dual degree in Materials from Indian Institute of technology, Kharagpur. Nikhil is an ai and Machine Learning enthusiast who is always researching applications in fields like Biomaterials and Biomedical Science. With a strong background in Materials Science, he is exploring new advancements and creating opportunities to contribute.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>