100+ new metrics since 2010
An evaluation with automatic metrics has the advantages to be faster, more reproducible, and cheaper than an evaluation conducted by humans.
This is especially true for the evaluation of machine translation. For a human evaluation, we would ideally need expert translators
For many language pairs, such experts are extremely rare and difficult to hire.
A large-scale and fast manual evaluation, as required by the very dynamic research area of machine translation to evaluate new systems, is often impractical.
Consequently, automatic evaluation for machine translation has been a very active, and productive, research area for more than 20 years.
While BLEU remains by far the most used evaluation metric, there are countless better alternatives.
Since 2010, 100+ automatic metrics have been proposed to improve machine translation evaluation.
In this article, I present the most popular metrics that are used as alternatives, or in addition, to BLEU. I grouped them into two categories: traditional or neural metrics, each category having different advantages.
Most automatic metrics for machine translation only require:
- The translation hypothesis generated by the machine translation system to evaluate
- At least one reference translation produced by humans
- (Rarely) the source text translated by the machine translation system
Here is an example of a French-to-English translation:
Le chat dort dans la cuisine donc tu devrais cuisiner ailleurs.
- Translation hypothesis (generated by machine translation):
The cat sleeps in the kitchen so cook somewhere else.
The cat is sleeping in the kitchen, so you should cook somewhere else.
The translation hypothesis and the reference translation are both translations of the same source text.
The objective of the automatic metric is to yield a score that can be interpreted as a distance between the translation hypothesis and the reference translation. The smaller the distance is and the closer the system is to generate a translation of human quality.
The absolute score returned by a metric is usually not interpretable alone. It is almost always used to rank machine translation systems. A system with a better score is a better system.
In one of my studies (Marie et al., 2021), I showed that almost 99% of the research papers in machine translation rely on the automatic metric BLEU to evaluate translation quality and rank systems, while more than 100 other metrics have been proposed during the last 12 years. Note: I looked only at research papers published from 2010 by the ACL. Potentially many more metrics have been proposed to evaluate machine translation.
Here is a non-exhaustive list of 106 metrics proposed from 2010 to 2020 (click on the metric name to get the source):
Noun-phrase chunking, SemPOS refinement, mNCD, RIBES, extended METEOR, Badger 2.0, ATEC 2.1, DCU-LFG, LRKB4, LRHB4, I-letter-BLEU, I-letter-recall, SVM-RANK,TERp, IQmt-DR, BEwT-E, Bkars, SEPIA, MEANT, AM-FM. AMBER, F15, MTeRater, MP4IBM1, ParseConf, ROSE, TINE, TESLA-CELAB, PORT, lexical cohesion, pFSM, pPDA, HyTER, SAGAN-STS, SIMPBLEU, SPEDE, TerrorCAT, BLOCKERRCATS, XENERRCATS, PosF, TESLA, LEPOR, ACTa, DEPREF, UMEANT, LogRefSS, discourse-based, XMEANT, BEER, SKL, AL-BLEU, LBLEU, APAC, RED-*, DiscoTK-*, ELEXR, LAYERED, Parmesan, tBLEU, UPC-IPA, UPC-STOUT, VERTa-*, pairwise neural, neural representation-based, ReVal, BS, LeBLEU, chrF, DPMF, Dreem, Ratatouille, UoW-LSTM, UPF-Colbat, USAAR-ZWICKEL, CharacTER, DepCheck, MPEDA, DTED, meaning features, BLEU2VEC_Sep, Ngram2vec, MEANT 2.0, UHH_TSKM, AutoDA, TreeAggreg, BLEND, HyTERA, RUSE, ITER, YiSi, BERTr, EED, WMDO, PReP, cross-lingual similarity+target language model, XLM+TLM, Prism, COMET, PARBLEU, PARCHRF, MEE, BLEURT, BAQ-*, OPEN-KIWI-*, BERT, mBERT, EQ-*
Most of these metrics have been shown to be better than BLEU, but have never been used. In fact, only 2 (1.8%) of these metrics, RIBES and chrF, have been used in more than two research publications (among the 700+ publications that I checked). Since 2010, the most used metrics are metrics proposed before 2010 (BLEU, TER, and METEOR):
Most of the metrics created after 2016 are neural metrics. They rely on neural networks and the most recent ones even rely on the very popular pre-trained language models.
In contrast, traditional metrics published earlier can be more simple and cheaper to run. They remain extremely popular for various reasons, and this popularity doesn’t seem to decline, at least in research.
In the following sections, I introduce several metrics selected according to their popularity, their originality, or their correlation with human evaluation.
Traditional metrics for machine translation evaluation can be seen as metrics that evaluate the distance between two strings simply based on the characters they contain.
These two strings are the translation hypothesis and the reference translation. Note: Typically, traditional metrics don’t exploit the source text translated by the system.
WER (Word Error Rate) was one the most used of these metrics, and the ancestor of BLEU, before BLEU took over in the early 2000’s.
Advantages:
- Low computational cost: Most traditional metrics rely on the efficiency of string matching algorithms run at character and/or token levels. Some metrics do need to perform some shifting of tokens which can be more costly, particularly for long translations. Nonetheless, their computation is easily parallelizable and doesn’t require a GPU.
- Explainable: Scores are usually easy to compute by hand for small segments and thus facilitate the analysis. Note: “Explainable” doesn’t mean “interpretable”, i.e., we can exactly explain how a metric score is computed, but the score alone can’t be interpreted as it usually tells us nothing of the translation quality.
- Language independent: Except some particular metrics, the same metric algorithms can be applied independently of the language of the translation.
Disadvantages:
- Poor correlation with human judgments: This is their main disadvantage against neural metrics. To get the best estimation of the quality of a translation, traditional metrics shouldn’t be used.
- Require particular preprocessing: Except for one metric (chrF), all the traditional metric I present in this article requires the evaluated segments, and their reference translations, to be tokenized. The tokenizer isn’t embedded in the metric, i.e., it has to be performed by the user using external tools. The scores obtained are then dependent on a particular tokenization that may not be reproducible.
BLEU
This is the most popular metric. It is used by almost 99% of the machine translation research publications.
I already presented BLEU in one of my previous article.
BLEU is a metric with many well-identified flaws.
What I didn’t discuss in my two articles about BLEU is the many variants of BLEU.
When reading research papers, you may find metrics denoted BLEU-1, BLEU-2, BLEU-3, and so on. The number after the hyphen is usually the maximum length of the n-grams of tokens used to compute the score.
For instance, BLEU-4 is a BLEU computed by taking {1,2,3,4}-grams of tokens into account. In other words, BLEU-4 is the typical BLEU computed in most machine translation papers, as originally proposed by Papineni et al. (2002).
BLEU is a metric that requires a lot of statistics to be accurate. It doesn’t work well on short text, and may even yield an error if computed on a translation that doesn’t match any 4-grams from the reference translation.
Since evaluating translation quality at sentence level may be necessary in some applications or for analysis, a variant denoted sentence BLEU, sBLEU, or sometimes BLEU+1 can be used. It avoids computational errors. There are many variants of BLEU+1. The most popular ones are described by Chen and Cherry (2014).
As we will see with neural metrics, BLEU+1 has many better alternatives and shouldn’t be used.
chrF(++)
chrF (Popović, 2015) is the second most popular metric for machine translation evaluation.
It has been around since 2015 and has since been increasingly used in machine translation publications.
It has been shown to better correlate with human judgment than BLEU.
In addition, chrF is tokenization independent. This is the only metric with this feature that I know of. Since it doesn’t require any prior custom tokenization by some external tool, it is one of the best metrics to ensure the reproducibility of an evaluation.
chrF exclusively relies on the characters. Spaces are ignored by default.
chrF++ (Popović, 2017) is a variant of chrF that better correlates with human evaluation but at the cost of its tokenization independence. Indeed, chrF++ exploits spaces to take into account word order, hence its better correlation with human evaluation.
I do strongly recommend the use of chrF when I review machine translation papers for conferences and journals to make an evaluation more reproducible, but not chrF++ due to its tokenization dependency.
Note: Be wary when you read a research work using chrF. Authors often confuse chrF and chrF++. They may also cite the chrF paper when using chrF++, and vice versa.
The original implementation of chrF by Maja Popović is available on github.
You can also find an implementation in SacreBLEU (Apache 2.0 license).
RIBES
RIBES (Isozaki et al., 2010) is regularly used by the research community.
This metric was designed for “distant language pairs” with very different sentence structures.
For instance, translating English into Japanese requires a significant word reordering since the verb in Japanese is located at the end of the sentence while in English it is usually placed before the complement.
The authors of RIBES found that the metrics available at that time, in 2010, were not sufficiently penalizing incorrect word order and thus proposed this new metric instead.
An implementation of RIBES is available on Github (GNU General Public License V2.0).
METEOR
METEOR (Banerjee and Lavie, 2005) was first proposed in 2005 with the objective of correcting several flaws of traditional metrics available at that time.
For instance, BLEU only counts exact token matches. It is too strict since words are not rewarded by BLEU if they are not exactly the same in the reference translation even if they have a similar meaning. As such, BLEU is blind to many valid translations.
METEOR partly corrects this flaw by introducing more flexibility in the matching. Synonyms, word stems, and even paraphrases are all accepted as valid translations, effectively improving the recall of the metric. The metric also implements a weighting mechanism to give more importance, for instance, to an exact matching over a stem matching.
The metric is computed by the harmonic mean between recall and precision, with the particularity that the recall has a higher weight than precision.
METEOR better correlates with human evaluation than BLEU, and has been improved multiple times until 2015. It is still regularly used nowadays.
METEOR has an official webpage maintained by CMU which proposes the original implementation of the metric (unknown license).
TER
TER (Snover et al., 2006) is mainly used to evaluate the effort it would take for a human translator to post-edit a translation.
Definition
Post-editing in machine translation is the action of correcting a machine translation output into an acceptable translation. Machine translation followed by post-editing is a standard pipeline used in the translation industry to reduce translation cost.
There are two well-known variants: TERp (Snover et al., 2009) and HTER (Snover et al., 2009, Specia and Farzindar, 2010).
TERp is TER augmented with a paraphrase database to improve the recall of the metric and its correlation with human evaluation. A match between the hypothesis and the reference is counted if a token, or one of its paraphrases, from the translation hypothesis is in the reference translation.
HTER, standing for “Human TER”, is a standard TER computed between machine translation hypothesis and its post-editing produced by a human. It can be used to evaluate the cost, a posteriori, of post-editing a particular translation.
CharacTER
The name of the metric already gives some hints on how it works: This is the TER metric applied at character level. Shift operations are performed at word level.
The edit distance obtained is also normalized by the length of the translation hypothesis.
CharacTER (Wang et al., 2016) has one of the highest correlation with human evaluation among the traditional metrics.
Nonetheless, it remains less used than other metrics. I couldn’t find any papers that used it recently.
The implementation of characTER by its authors is available on Github (unknown license).
Neural metrics take a very different approach from the traditional metrics.
They estimate a translation quality score using neural networks.
To the best of my knowledge, ReVal, proposed in 2015, was the first neural metric with the objective of computing a translation quality score.
Since ReVal, new neural metrics are regularly proposed for evaluating machine translation.
The research effort in machine translation evaluation is now almost exclusively focusing on neural metrics.
Yet, as we will see, despite their superiority, neural metrics are far from popular. While neural metrics have been around for almost 8 years, traditional metrics are still overwhelmingly preferred, at least by the research community (the situation is probably different in the machine translation industry).
Advantages:
- Good correlation with human evaluation: Neural metrics are state-of-the-art for machine translation evaluation
- No preprocessing required: This is mainly true for recent neural metrics such as COMET and BLEURT. The preprocessing, such as tokenization, is done internally and transparently by the metric, i.e., the users don’t need to care about it.
- Better recall: Thanks to the exploitation of embeddings, neural metrics can reward translation even when they don’t exactly match the reference. For instance, a word that has a meaning similar to a word in the reference will be likely rewarded by the metric, in contrast to traditional metrics that can only reward exact matches.
- Trainable: This may be an advantage as well as a disadvantage. Most neural metrics must be trained. It is an advantage if you have training data for your specific use case. You can fine-tune the metric to best correlate with human judgments. However, if you don’t have the specific training data, the correlation with human evaluation will be far from optimal.
Disadvantages:
- High computational cost: Neural metrics don’t require a GPU but are much faster if you have one. Yet, even with a GPU, they are significantly slower than traditional metrics. Some metrics relying on large language models such as BLEURT and COMET also require a significant amount of memory. Their high computational cost also makes statistical significance testing extremely costly.
- Unexplainable: Understanding why a neural metric yields a particular score is nearly impossible since the neural model behind it often leverages millions or billions of parameters. Improving the explainability of neural models is a very active research area.
- Difficult to maintain: Older implementations of neural metrics don’t work anymore if they were not properly maintained. This is mainly due to the changes in nVidia CUDA and/or frameworks such as (py)Torch and Tensorflow. Potentially, the current version of the neural metrics we use today won’t work in 10 years.
- Not reproducible: Neural metrics usually come with many more hyperparameters than traditional metrics. Those are largely underspecified in the scientific publications using them. Therefore, reproducing a particular score for a particular dataset is often impossible.
ReVal
To the best of my knowledge, ReVal (Gupta et al., 2015) is the first neural metric proposed to evaluate machine translation quality.
ReVal was a significant improvement over traditional metrics with a significantly better correlation with human evaluation.
The metric is based on an LSTM and is very simple, but has never been used in machine translation research as far as I know.
It is now outperformed by more recent metrics.
If you are interested to understand how it works, you can still find ReVal’s original implementation on Github (GNU General Public License V2.0).
YiSi
YiSi (Chi-kiu Lo, 2019) is a very versatile metric. It mainly exploits an embedding model but can be augmented with various resources such as a semantic parser, a large language model (BERT), and even features from the source text and source language.
Using all these options can make it fairly complex and reduces its scope to a few language pairs. Moreover, the gains in terms of correlation with human judgments when using all these options are not obvious.
Nonetheless, the metric itself, using just the original embedding model, shows a very good correlation with human evaluation.
The author showed that for evaluating English translations YiSi significantly outperforms traditional metrics.
The original implementation of YiSi is publicly available on Github (MIT license).
BERTScore
BERTScore (Zhang et al., 2020) exploits the contextual embeddings of BERT for each token in the evaluated sentence and compares them with the token embeddings of the reference.
It works as illustrated below:
It is one of the first metrics to adopt a large language model for evaluation. It wasn’t proposed specifically for machine translation but rather for any language generation task.
BERTScore is the most used neural metric in machine translation evaluation.
A BERTScore implementation is available on Github (MIT license).
BLEURT
BLEURT (Sellam et al., 2020) is another metric relying on BERT but that can be specifically trained for machine translation evaluation.
More precisely, it is a BERT model fine-tuned on synthetic data that are sentences from Wikipedia paired with their random perturbations of different kinds: Note: This step is confusedly denoted “pre-training” by the authors (see note 3 in the paper) but it actually comes after the original pre-training of BERT.
- Masked word (as in the original BERT)
- Dropped word
- Backtranslation (i.e., sentences generated by a machine translation system)
Each sentence pair is evaluated during training with several losses. Some of these losses are computed with evaluation metrics:
Finally, in a second phase, BLEURT is fine-tuned on translations and their rating provided by humans.
Intuitively, thanks to the use of synthetic data that may resemble machine translation errors or outputs, BLEURT is much more robust to quality and domain drifts than BERTScore.
Moreover, since BLEURT exploits a combination of metric as “pre-training signals”, it is intuitively better than each one of these metrics, including BERTScore.
However, BLEURT is very costly to train. I am only aware of BLEURT checkpoints released by Google. Note: If you are aware of other models, please let me know in the comments.
The first version was only trained for English, but the newer version, denoted BLEURT-20, now includes 19 more languages. Both BLEURT versions are available in the same repository.
Prism
In their work proposing Prism, Thompson and Post (2019) intuitively argue that machine translation and paraphrasing evaluation are very similar tasks. Their only difference is that the source language is not the same.
Indeed, with paraphrasing, the objective is to generate a new sentence A’, given a sentence A, with A and A’ having the same meaning. Assessing how close A and A’ is identical to assessing how a translation hypothesis is close to a given reference translation. In other words, is the translation hypothesis a good paraphrase of the reference translation.
Prism is a neural metric trained on a large multilingual parallel dataset through a multilingual neural machine translation framework.
Then, at inference time, the trained model is used as a zero-shot paraphraser to score the similarity between a source text (the translation hypothesis) and the target text (the reference translation) that are both in the same language.
The main advantage of this approach is that Prism doesn’t need any human evaluation training data nor any paraphrasing training data. The only requirement is to have parallel data for the languages you plan to evaluate.
While Prism is original, convenient to train, and seems to outperform most other metrics (including BLEURT), I couldn’t find any machine translation research publication using it.
The original implementation of Prism is publicly available on Github (MIT license).
COMET
COMET (Rei et al., 2020) is a more supervised approach also based on a large language model. The authors selected XLM-RoBERTa but mention that other models such as BERT could also work with their approach.
In contrast to most other metrics, COMET exploits the source sentence. The large language model is thus fine-tuned on a triplet {translated source sentence, translation hypothesis, reference translation}.
The metric is trained using human ratings (the same ones used by BLEURT).
COMET is much more simple to train than BLEURT since it doesn’t require the generation and the scoring of synthetic data.
COMET is available in many versions, including distilled models (COMETHINO) that have a much smaller memory footprint.
The released implementation of COMET (Apache license 2.0) also includes a tool to efficiently perform statistical significance testing.
Machine translation evaluation is a very active research area. Neural metrics are getting better and more efficient every year.
Yet, traditional metrics such as BLEU remain the favorites of machine translation practitioners, mainly by habits.
In 2022, the Conference on Machine Translation (WMT22) published a ranking of evaluation metrics according to their correlation with human evaluation, including metrics I presented in this article:
COMET and BLEURT rank at the top while BLEU appears at the bottom. Interestingly, you can also notice in this table that there are some metrics that I didn’t write about in this article. Some of them, such as MetricX XXL, are undocumented.
Despite having countless better alternatives, BLEU remains by far the most used metric, at least in machine translation research.
Personal recommendations:
When I review scientific papers for conferences and journals, I always recommend the following to the authors who only use BLEU for machine translation evaluation:
- Add the results for at least one neural metric such as COMET or BLEURT, if the language pair is covered by these metrics.
- Add the results for chrF (not chrF++). While chrF is not state-of-the-art, it is significantly better than BLEU, yield scores that are easily reproducible, and can be used for diagnostic purposes.