Author's image | Canva
Language translation has become an essential tool in our increasingly globalized world. Whether you are a developer, researcher, or traveler, you will always have the need to communicate with people from different cultures. Therefore, the ability to translate texts quickly and accurately can be very useful. A powerful resource for achieving this is the MarianMT model, part of the Hugging Face Transformers library.
In this guide, we will walk you through the process of using MarianMT to translate text between multiple languages, making it accessible even to those with minimal technical knowledge.
What is MarianMT?
MarianMT is a machine translation framework based on the Transformer architecture, widely recognized for its effectiveness in natural language processing tasks. Developed using the Marian C++ library, MarianMT models have the enormous advantage of being fast. Hugging Face has incorporated MarianMT into its Transformers library, making it easy to access and use via Python.
Step-by-step guide to using MarianMT
1. Installation
To get started, you need to install the required libraries. Make sure you have Python installed on your system and then run the following command to install the Hugging Face Transformers library:
You will also need the torch library to handle the model calculations:
2. Choosing a model
MarianMT models are pre-trained on multiple language pairs. The models follow a naming convention of Helsinki-NLP/opus-mt-{src}-{tgt}
in the embraced face, where {src}
and {tgt}
are the source and target language codes, respectively. For example, if you search for Helsinki-NLP/opus-mt-en-fr
In the hugging face, the corresponding model would be translated from English to French.
3. Loading the model and tokenizer
Let's say you decide to translate from English to a specific language, i.e. French. In that case, you will need to upload the correct model and its corresponding tokenizer. Here's how to upload the model and the tokenizer:
from transformers import MarianMTModel, MarianTokenizer
# Specify the model name
model_name = "Helsinki-NLP/opus-mt-en-fr"
# Load the tokenizer and model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
4. Text translation
Now that you have your model and tokenizer ready, you can translate text in just 4 easy steps. Here is a basic example. First, you need to specify the source text in a variable that you want to translate.
# Define the source text
src_text = ("this is a sentence in English that we want to translate to French")
Since transformers (or any machine learning model) do not understand text, we want to convert the source text into numeric format. For this, we would tokenize our text. For an in-depth understanding of how to perform tokenization, you can refer to my tutorial Tokenization article.
# Tokenize the source text
inputs = tokenizer(src_text, return_tensors="pt", padding=True)
We will then pass the tokenized sentence to the model and it will generate some numbers.
# Generate the translation
translated = model.generate(**inputs)
Note that the model outputs tokens and not text directly. We would have to decode these tokens back into text so that humans can understand the translated output of the model.
# Decode the translated text
tgt_text = (tokenizer.decode(t, skip_special_tokens=True) for t in translated)
print(tgt_text)
In the above code, the output will be the text translated into French:
c'est une phrase en anglais que nous voulons traduire en français
5. Translation into multiple languages
If you want to translate English text into multiple languages, you can use multilingual models. For example, the model Helsinki-NLP/opus-mt-en-ROMANCE
You can translate from English to several Romance languages (French, Portuguese, Spanish, etc.). Specify the target language by prepending the target language code to the source text:
src_text = (
">>fr<< this is a sentence in English that we want to translate to French",
">>pt<< This should go to Portuguese",
">>es<< And this to Spanish",
)
# Specify the multilingual model
model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Tokenize the source text
inputs = tokenizer(src_text, return_tensors="pt", padding=True)
# Generate the translations
translated = model.generate(**inputs)
# Decode the translated text
tgt_text = (tokenizer.decode(t, skip_special_tokens=True) for t in translated)
print(tgt_text)
The output would look like this:
("c'est une phrase en anglais que nous voulons traduire en français",
'Isto deve ir para o português.',
'Y esto al español')
With this setting, you can easily translate your English text into French, Portuguese and Spanish. There are also some language groups besides the ROMANCE languages. Here is a list of them:
GROUP_MEMBERS = {
'ZH': ('cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'),
'ROMANCE': ('fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'),
'NORTH_EU': ('de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'),
'SCANDINAVIA': ('da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'),
'SAMI': ('se', 'sma', 'smj', 'smn', 'sms'),
'NORWAY': ('nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'),
'CELTIC': ('ga', 'cy', 'br', 'gd', 'kw', 'gv')
}
Ending
Using MarianMT models with the Hugging Face Transformers library offers a powerful and flexible way to perform language translations. Whether you are translating text for personal use, research, or integrating translation capabilities into your applications, MarianMT offers a reliable and easy-to-use solution. With the steps outlined in this guide, you can begin translating languages efficiently and effectively.
Kanwal Mehreen Kanwal is a machine learning engineer and technical writer with a deep passion for data science and the intersection of ai with medicine. She is the co-author of the eBook “Maximizing Productivity with ChatGPT.” As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She is also recognized as a Teradata Diversity in tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change and founded FEMCodes to empower women in STEM fields.