The fields of artificial intelligence and machine learning rely solely on data. Everyone is inundated with data from different sources like social media, healthcare, finance, etc., and this data is highly useful for applications that involve natural language processing. But even with so much data, easily usable data is scarce to train an NLP model for a particular task. Finding high-quality data with utility and good-quality filters is a difficult task. Speaking specifically about the development of NLP models for different languages, the lack of data for most languages is presented as a limitation that hinders progress on NLP for underrepresented (UL) languages.
Emerging tasks, such as news roundup, sentiment analysis, answering questions, or developing a virtual assistant, are highly dependent on the availability of data in languages from many resources. These tasks depend on technologies such as language identification, automatic speech recognition (ASR) or optical character recognition (OCR), which are mostly not available for underrepresented languages, to overcome them it is important to build data sets and evaluate models in tasks that would be beneficial to UL speakers.
Recently, a team of GoogleAI researchers proposed a benchmark called XTREME-UP (Under-Represented and User-Centric with Paucal Data) that evaluates multilingual models on user-centric tasks in a low-shot learning environment. It focuses primarily on the activities that technology users typically perform in their daily lives, such as information access and input/output activities that other technologies enable. The three main features that set XTREME-UP apart are: its use of sparse data, its user-centric design, and its focus on underrepresented languages.
With XTREME-UP, the researchers have introduced a language standardized multilingual fine tuning setting instead of the conventional multilingual zero trigger option. This method considers the amount of data that can be generated or annotated in an 8-hour period for a particular language, with the goal of providing ULs with a more useful testing setup.
XTREME-UP assesses the performance of language models in 88 underrepresented languages across 9 major user-centric technologies, some of which include Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Machine Translation ( MT) and access to information. tasks that have a general utility. Researchers have developed new data sets specifically for operations such as OCR, autocomplete, semantic analysis, and transliteration to assess the capabilities of language models. They have also improved and polished currently existing data sets for other tasks at the same benchmark.
XTREME-UP has one of its key abilities to evaluate various modeling situations, including text-only and multi-modal scenarios with visual, audio, and text inputs. It also offers methods for supervised parameter tuning and learning in context, allowing comprehensive evaluation of various modeling approaches. Tasks in XTREME-UP involve enabling access to language technology, enabling access to information as part of a larger system, such as question answering, information retrieval, and virtual assistants, followed by making the information is accessible in the language of the speaker.
Consequently, XTREME-UP is a great benchmark that addresses the challenge of data scarcity in highly multilingual NLP systems. It is a standardized assessment framework for underrepresented language and seems really useful for further NLP research and development.
review the Paper and Github. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.