Large language models require large datasets of prompts paired with particular user requests and correct answers for training purposes. Large language models require this for understanding and generating human-like texts as answers to various questions. In contrast, unlike other languages, primarily Arabic, immense efforts have been made to develop such datasets in English. This imbalance in data availability across languages severely restricts the applicability of large language models to non-English speaking regions and thus denotes a critical need in the natural language domain.
The recent research challenge that this article addresses is the need for good quality Arabic prompt datasets to train LLMs to perform well in Arabic. These issues need to be addressed in order for LLMs to be able to effectively understand and generate Arabic texts. Therefore, they would contribute less to the utility of Arabic-speaking users. This is quite relevant because Arabic is spoken by one of the largest numbers of people in the world. However, it lacks sufficient resources for its language, meaning that current ai technologies serve a large fraction of humanity. In addition to the complexity of the Arabic language, due to its rich morphology and the large number of dialects, a lot of work is needed to develop templates that can represent the language in the appropriate manner. Therefore, creating a highly powerful dataset for Arabic is important to increase the utility of LLM models for a wider audience.
Current approaches to generating suggestion datasets are primarily geared toward English and involve either manually creating suggestions or tools that generate them based on existing datasets. For example, PromptSource and Super-NaturalInstructions have made millions of suggestions available to English-language master’s students. However, these methods have not yet been adapted on a large scale for other languages, and thus resources for training master’s students in languages other than English are very scarce. This, coupled with the limited availability of suggestion datasets in languages such as Arabic, may have hampered the ability of master’s students to excel in these languages, underscoring the need for more focused efforts in dataset creation.
Researchers at aiXplain Inc. have introduced two innovative methods to create large-scale Arabic message datasets to address this problem. The first method involves translating existing English message datasets into Arabic using a machine translation system, followed by a rigorous quality assessment process. This method relies on state-of-the-art machine translation technologies and quality estimation tools to ensure that the translated messages maintain high accuracy. By applying these techniques, the researchers retained approximately 20% of the translated messages, resulting in a dataset of around 20 million high-quality Arabic messages. The second method focuses on creating new messages directly from existing Arabic NLP datasets. This method uses a message harvesting tool to generate messages for 78 publicly available Arabic datasets, covering tasks such as question answering, summarizing, and hate speech detection. More than 67.4 million messages were created through this process, significantly expanding the resources available for LLM training in Arabic.
The translation-based approach follows an end-to-end data processing pipeline, starting with converting English messages into sentences that are then translated into Arabic using a neural machine translation model. It then performs a quality estimation of such translations using a reference-free machine translation quality estimation model, where each sentence is assigned a quality score. These messages will be retained only if the set quality threshold is met; therefore, the final dataset will be highly accurate. Manual verification is performed on a random sample of messages to further increase the quality of the dataset. Another approach is to generate messages directly; PromptSource creates multiple templates for each task in the Arabic datasets. The approach enables the creation of diverse and contextually relevant messages, desirable for training effective language models.
The researchers then used these new messages to tune a 7-billion-parameter open-ended LLM, namely the Qwen2 7B model. The tuned model was tested on several benchmarks and significantly improved its handling of Arabic messages, outperforming a state-of-the-art 70-billion-parameter instruction-tuned model, Llama3 70B. Specifically, the Qwen2 7B model tuned with just 800,000 messages achieved a ROUGE-L score of 0.184, while the model tuned with 8 million messages achieved a score of 0.224. These results highlight the effectiveness of the newly developed message datasets and demonstrate that tuning with larger datasets leads to improved model performance.
In a nutshell, this research addresses a serious problem: there are no Arabic indication datasets available to train large language models. The research has opened up the resources for training large language models in Arabic by introducing two new ways of creating such datasets. Fine-tuning the Qwen2 7B model using these newly generated indications produces a model that outperforms all other existing models in terms of performance and sets a gold standard for large language models in Arabic. This points to the need to develop robust and scalable methods for creating datasets in languages other than English.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>