The rapid advancement of large language models has ushered in a new era of natural language processing capabilities. However, a major challenge remains: most of these models are primarily trained on a limited set of widely spoken languages, leaving vast linguistic diversity unexplored. This limitation not only restricts accessibility to cutting-edge language technologies, but also perpetuates a technological gap between language communities.
The researchers have addressed this challenge in this study by proposing a new ai method called SambaLingo. This approach aims to adapt existing high-performance language models to new languages, taking advantage of the strengths of pre-trained models and adapting them to the unique characteristics of the target language.
Previous efforts to address this problem have primarily focused on training multilingual or language-specific monolithic models from scratch. However, these approaches face significant obstacles, including the curse of multilingualism, data scarcity, and the significant computational resources required. Adapting English-focused models to new languages has emerged as a promising alternative, demonstrating the potential to outperform language-specific models pre-trained from scratch.
The SambaLingo methodology begins with the selection of a suitable base model that has already shown exceptional performance in its initial language. In this study, the researchers chose the open source Llama2 7B model, known for its English language capabilities, as a starting point.
To effectively capture the linguistic nuances of the target language, the researchers expanded the model's vocabulary by adding non-overlapping tokens from the target language and initializing them using subword embeddings from the original tokenizer. This crucial step ensures that the model can accurately tokenize and represent the new language, paving the way for a seamless adaptation.
The researchers then used a continuous pre-training approach, feeding the model with a carefully curated mix of English and target language web data obtained from CulturaX. The data combination followed a 1:3 ratio, biased towards the target language, to strike a delicate balance between preserving the model's existing knowledge and adapting it to the new linguistic landscape.
To further improve the model's alignment with human preferences, the researchers implemented a two-stage process: supervised adjustment (OFV) and direct preference optimization (DPO). During SFT, they used the ultrachat-200k dataset and its machine-translated version. For DPO, they employed innocuous ultrafeedback and cai conversation datasets, combining them with a 10:1 ratio of English to machine-translated data.
Researchers rigorously evaluated SambaLingo models on various tasks and languages, including language modeling, translation, text classification, open-book and closed-book question answering, and various natural language understanding benchmarks, as shown in Table 1. The models were tested in nine typologically diverse languages: Arabic, Thai, Turkish, Japanese, Hungarian, Russian, Bulgarian, Serbian and Slovenian.
Across multiple benchmarks, SambaLingo models consistently outperformed existing state-of-the-art models in these languages. For example, in the perplexity benchmark, which measures language modeling performance, SambaLingo models achieved lower perplexity scores than all existing baselines in a training data set (Figure 1). Furthermore, when scaled up to the larger Llama2 70B parameter scale, the SambaLingo models showed even better performance, outperforming their 7B counterparts on multiple benchmarks, despite being trained with fewer tokens.
To validate the quality of the model's output and its alignment with human preferences, the researchers employed GPT-4 as an impartial judge, evaluating the model's responses to real user prompts. The results were promising, and SambaLingo consistently outperformed other models in the same languages, judging by GPT-4 preferences and logical explanations.
In summary, the SambaLingo methodology represents a significant step towards the democratization of artificial intelligence in linguistic diversity. By leveraging the strengths of existing high-performance models and adapting them to new linguistic landscapes, this approach offers a scalable and efficient solution to the challenge of language barriers. With its next-generation performance and alignment with human preferences, SambaLingo paves the way for a future where the benefits of ai transcend linguistic boundaries, fostering inclusivity and accessibility for all.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
Do you want to be in front of 1.5 million ai audiences? Work with us here
Vineet Kumar is a Consulting Intern at MarktechPost. She is currently pursuing her bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>