Google AI Introduces FRMT: A New Dataset and Evaluation Benchmark for Region-Aware Machine Translation with Few Shots

In recent years, machine translation (MT) has made great strides, with outstanding results for many language pairs, particularly those with a lot of parallel data available. Some earlier work has addressed finer distinctions, such as those between regional variations of Arabic or precise levels of politeness in German, although machine translation work is typically awarded at the broad level of a language (such as Spanish or Hindi). Unfortunately, most existing methods for style-oriented translation rely on large, labeled training corpora, which are often unavailable or too expensive to generate.

Newly published research by Google introduces Few Shots Region-Aware Machine Translation (FRMT), a benchmark for few shots translation that assesses the ability of a machine translation model to translate into regional variants using no more than 100 tagged instances of each language variety.

To find similarities between your training examples and the small number of tagged instances (“instances”), MT models must employ the language patterns highlighted in the tagged examples. This allows to generalize the models, correctly translating phenomena that are not present in the examples.

🚨 Read our latest AI newsletter🚨

The FRMT dataset consists of partially translated versions of English Wikipedia articles into various regional dialects of Portuguese and Mandarin taken from the Wiki40b dataset. The team created the data set using three content segments to highlight the most significant translation issues by region:

Lexicon: The lexical cube focuses on word choices that vary by area. The team manually gathered between 20 and 30 terms that have diverse regional translations. They filtered and verified the translations with input from volunteer native speakers from each region. They took the final list of English terms and extracted texts from the corresponding English Wikipedia articles, each containing up to 100 sentences (eg, bus). The identical procedure was carried out independently for Mandarin.
Entity: The entity bucket is filled with people, places, or other entities strongly connected to one of the two regions in question for a particular language.
The Random Cube contains text from 100 randomly selected articles from Wikipedia’s “Featured” and “Excellent” collections. It is used to verify that a model properly handles multiple occurrences.

The researchers performed a human quality assessment of the translations to ensure they accurately represented region-specific phenomena in the FRMT dataset. The Multidimensional Quality Metrics (MQM) framework was used by expert annotators from each region to find and classify translation errors. The framework incorporates a category weighting mechanism to combine the identified flaws into a single score that typically represents the number of significant errors per sentence.

The researchers invited MQM testers to evaluate the translations from each region and the translations from the other region of their language. The team found that in both Portuguese and Chinese, testers noticed, on average, two more major errors per sentence in mismatched translations than in matched ones. This proves that the proposed data set accurately reflects local phenomena.

The best way to ensure model quality is through human inspection, but this process is often time consuming and expensive. Therefore, the researchers looked at chrF, BLEU, and BLEURT to identify an existing automatic metric that researchers can use to assess their models against the proposed benchmark. The findings suggest that BLEURT has the best correlation with human evaluations and that the level of that correlation is comparable to inter-annotator consistency using translations of some reference models that were also reviewed by our MQM testers.

The team hopes that their work will help the research community create new MT models that better serve the variety of underrepresented languages and all speaker communities, ultimately leading to more inclusiveness in the technology. of natural language.

review the Paper, Github and Reference article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 14k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.