Many languages spoken around the world cover numerous regional varieties (sometimes called dialects), such as Brazilian and European portuguese or Continent and taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, important differences still exist. For example, the Brazilian Portuguese word for “bus” is buswhile the European Portuguese word is bus. Yet today translation machine Systems (MT) do not normally allow users to specify which variety of language to translate into. This can lead to confusion if the system generates the “wrong” variety or mixes varieties unnaturally. In addition, region-agnostic machine translation systems tend to favor the variety with the most data available online, disproportionately affecting low-income speakers of language varieties.
In “FRMT: A Benchmark for Region-Aware Machine Translation with Few Shots”, accepted for publication in Transactions of the Association for Computational Linguisticswe present a evaluation data set used to measure the ability of machine translation systems to support regional varieties through a case study on Brazilian Portuguese versus European and Continental Mandarin Chinese versus Taiwanese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways to create machine translation systems that are applicable to the many varieties of regional languages spoken in everyone.
Challenge: generalization of few attempts
Most modern MT systems are trained on millions or billions of example translations, such as an English gateway sentence and its corresponding Portuguese translation. However, the vast majority of available training data does not specify which regional variety the translation is in. In light of this paucity of data, we position FRMT as a benchmark for few shots translation, which measures the ability of an MT model to translate into regional varieties when given no more than 100 labeled instances of each language variety. MT models need to use the linguistic patterns displayed in the small number of labeled instances (called “instances”) to identify similar patterns in their unlabeled training examples. In this way, models can generalizeproducing correct translations of phenomena not explicitly shown in the examples.
An illustration of a few shots MT system translating the English sentence, “The bus has arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; good). |
Few-shot MT approaches are attractive because they make it much easier to add support for additional regional varieties to an existing system. Although our work is specific to regional varieties of two languages, we anticipate that methods that work well will be easily applicable to other languages and regional varieties. In principle, those methods should also work for other linguistic distinctions, such as formality and style.
Data collection
The FRMT dataset consists of partial English Wikipedia articles, extracted from the Wiki40b data set, which have been translated by paid professional translators into different regional varieties of Portuguese and Mandarin. To highlight key region-aware translation challenges, we designed the dataset using three content groups: (1) Lexical, (2) Entity, and (3) Random.
- The lexical cube focuses on regional differences in word choice, such as the “bus” against “bus” distinction when translating a sentence with the word “bus” to Brazilian Portuguese vs. European, respectively. We manually collected 20-30 terms that have distinctive regional translations based on educational blogs and websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (eg, bus). The same process was carried out independently for Mandarin.
- The Entity cube is populated in a similar way and refers to people, locations, or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like: “In Lisbon, I often took the bus.” To correctly translate this into Brazilian Portuguese, a model must overcome two potential hurdles:
- The strong geographical association between Lisbon and Portugal could influence a model to generate a European Portuguese translation instead, for example, by selecting “bus” instead of “bus“.
- Replacing “Lisbon” with “brasilia” might be a naive way for a model to localize its output to Brazilian Portuguese, but it would be semantically inaccurate, even in a fluent translation.
- The Random Cube is used to verify that a model correctly handles various other phenomena and consists of text from 100 randomly selected Wikipedia articles.presented” and “good” collections.
Evaluation methodology
To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human assessment of their quality. Expert annotators from each region used the Multidimensional quality metrics (MQM) framework for identifying and categorizing errors in translations. The framework includes a category weighting scheme to convert identified errors into a single score that roughly represents the number of significant errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM testers to rate both the translations from their region and the translations from the other region of their language. For example, the Brazilian Portuguese raters rated the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not in the other. We found that in both Portuguese and Chinese, testers identified, on average, about two more major errors per sentence in mismatched translations than in matched ones. This indicates that our data set truly captures region-specific phenomena.
While human evaluation is the best way to be sure of model quality, it is often time consuming and expensive. Therefore, we wanted to find an existing automatic metric that researchers could use to evaluate their models on our benchmark, and we considered chrf, BLUEand BLUERT. Using the translations of some reference models that were also evaluated by our MQM raters, we found that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson’s correlation coefficient, r) is comparable to the consistency between annotators (intraclass correlation of 0.70).
Metric | Pearson’s ρ | ||
chrf | 0.48 | ||
BLUE | 0.58 | ||
BLUES | 0.65 |
Correlation between different automated metrics and human judgments on translation quality in a subset of FRMT. Values are between -1 and 1; higher is better. |
system performance
Our review covered a handful of recent models capable of low-shot control. Based on human evaluation with MQM, all of the baseline methods showed some ability to localize their output for Portuguese, but for Mandarin, they mostly did not use knowledge of the target region to produce superior translations from mainland China or Taiwan.
Google’s recent language model, PaLM, scored the best overall among the baselines we evaluated. To produce region-targeted translations with PaLM, we feed an instructive flag into the model and then generate text from it to fill in the blank (see example below).
Translate the following texts from English to European Portuguese. English: [English example 1]. European Portuguese: [correct translation 1]. ... English: [input]. European Portuguese: _____"
PaLM did well using a single example, and had marginal gains in quality in Portuguese when increased to ten examples. This performance is impressive considering that PaLM was trained without supervision. Our results also suggest that language models such as PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. see our paper for more details.
MQM performance on groups of data sets using human and PaLM translations. The thick bars represent the case of region matching, where testers from each region evaluate translations targeting their own region. The thin inset bars represent the region mismatch case, where evaluators from each region evaluate translations targeting the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so only for all Portuguese cubes and the Mandarin lexical cube. |
Conclusion
In the near future, we hope to see a world where language generation systems, especially machine translation, can be supported by all speaking communities. We want to meet users where they are, generating a fluent and appropriate language for their locality or region. To that end, we have launched the FRMT data set and reference point, allowing researchers to easily compare the performance of region-aware machine translation models. Validated through our extensive human evaluation studies, the language varieties in FRMT have significant differences that should reflect the results of region-aware AT models. We are excited to see how researchers use this benchmark in developing new MT models that better support underrepresented language varieties and all speaker communities, leading to greater equity in natural language technologies.
Thanks
We thank our article co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. We thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes, and Mingfei Lau for helpful discussions and comments on the article. We thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues, and Linting Xue for essential comments on specific regional language differences. For the logistical support in the collection of translations and human qualifications, we thank the Google Translate team. We thank the professional translators and testers of MQM for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.