Significant achievements have been made in LLMs, exemplified by ChatGPT, excelling at complex language processing tasks. But most mainstream LLMs, such as LLaMA, are pre-trained on English-dominated corpora. Another example is LaMDA, proposed by Google, which is pre-trained on texts containing more than 90% English. This limits the performance of LLMs in languages other than English, which is a concern for non-English speaking users.
Recent advances in LLM such as ChatGPT, PaLM and LLaMA show advanced reasoning, planning and experiential learning capabilities. While many LLMs understand multiple languages, unbalanced linguistic resources pose challenges. BLOOM's previous training in 46 languages lacks diversity and LLaMA faces difficulties with languages other than English. Research on vocabulary extension and transfer processes reveals efficient linguistic transfer at minimal cost.
Researchers at Fudan University School of Computer Science have focused on effectively transferring language generation capabilities and following instructions in languages other than English. To address this, they have analyzed the impact of key factors such as vocabulary extension, additional pre-training and adjustment of instructions on transfer. The assessment involves four standardized benchmarks.
The research explores the transfer of language generation and instruction following capabilities to languages other than English using LLaMA. Due to its rich linguistic resources, it uses Chinese as a starting point, expanding the findings to more than ten low-resource languages. The models include LLaMA, LLaMA2, Chinese LLaMA, Chinese LLaMA2 and Open Chinese LLaMA, each with different pre-training scales. The assessment includes benchmarks such as LLM-Eval, C-Eval, MMLU, AGI-Eval and GAOKAO-Bench. The quality of the response is evaluated based on accuracy, fluency, informative content, logical coherence, and harmlessness. The study achieves state-of-the-art performance with minimal pre-training data, providing insights for non-English LLM development.
The study investigates language transfer to languages other than English using LLaMA, focusing on vocabulary extent, impact of training scale, and multilingual proficiency. Surprisingly, expanding vocabulary decreases performance in Chinese. Although a larger scale of pretraining initially improves response quality, it stabilizes, emphasizing language generation over knowledge acquisition. English proficiency is affected by exclusive training in Chinese. Evaluations in 13 low-resource languages show that SFT data improves response quality, with Arabic, Indonesian, and Vietnamese excelling. Code-switching samples suggest that LLaMA learns cross-language semantic alignment during pre-training, which improves transferability. The study emphasizes nuanced approaches to effective LLM development in languages other than English.
Table 1: Model response quality evaluation results for 13 low-resource languages in the LLM-Eval. ACC., F., LC., H., INFO. and AVG. They respectively denote precision, fluency, logical coherence, harmlessness, informativeness and average.
Researchers have focused on effectively transferring language generation and instruction-following capabilities to a language other than English. Specifically, they have carried out an exhaustive empirical study to analyze the need to expand the vocabulary and the training scale necessary for effective transfer. They found that vocabulary expansion is unnecessary and that transfer performance comparable to state-of-the-art models can be achieved with less than 1% of additional pre-training data. Similar results are observed in the extension experiments in the 13 low-resource languages.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.
If you like our work, you'll love our newsletter.
Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.
<!– ai CONTENT END 2 –>