Defaults are never the best
When developing applications using neural models, it is common to try different hyperparameters to train the models.
For example, learning rate, learning schedule, and dropout rates are important hyperparameters that have a significant impact on the learning curve of your models.
What is much less common is the search for the better decoding hyperparameters. If you read a deep learning tutorial or scientific paper that discusses natural language processing applications, there is a high chance that the hyperparameters used for inference are not even mentioned.
Most authors, myself included, don’t bother looking for the best decoding hyperparameters and use the default ones.
However, these hyperparameters can have a significant impact on the results, and whatever decoding algorithm you are using, there are always some hyperparameters that should be adjusted for best results.
In this blog post, I show the impact of decoding hyperparameters with simple Python examples and a machine translation app. I focus on beam search, as this is by far the most popular decoding algorithm, and two particular hyperparameters.
To demonstrate the effect and importance of each hyperparameter, I’ll show some examples produced using the Face Hugging Transformers Packin Python.
To install this package, run in your terminal (I recommend doing it in a separate conda environment) the following command:
pip install transformers
I will use GPT-2 (MIT license) to generate simple sentences.
I will also run other examples in machine translation using Marian (MIT license). I installed it on Ubuntu 20.04, following the official instructions.
Beam search is probably the most popular decoding algorithm for language generation tasks.
It maintains at each time step, that is, for each new generated token, the what most probable hypotheses, according to the model used for the inference, and the rest are discarded.
Finally, at the end of the decoding, the most likely hypothesis will be the output.
k, usually called “beam size”, it is a very important hyperparameter.
with an older what you get a most likely hypothesis. Please note that when what=1, we speak of “greedy search” since we only keep the most probable hypothesis in each time step.
By default, in most applications, what is arbitrarily configured between 1 and 10. Values that may seem very low.
There are two main reasons for this:
- Growing what increases decoding time and memory requirements. In other words, it becomes more expensive.
- Higher what may produce more likely but worse results. This is mainly, but not only, due to the extension of the hypotheses. Longer hypotheses tend to have lower probability.so beam searching will tend to promote shorter hypotheses that may be less likely for some applications.
The first point can be easily fixed by doing better batch decoding and investing in better hardware.
Length bias can be controlled through another hyperparameter that normalizes the probability of a hypothesis by its length (number of tokens) at each time step. There are numerous ways to perform this normalization. One of the most used equations was proposed by Wu et al. (2016):
lp(Y) = (5 + |Y|)α / (5 + 1)α
where |Y| is the length of the hypothesis and a a hyperparameter usually set between 0.5 and 1.0.
The lp(Y) score is then used to modify the probability of the hypothesis to bias the decoding to produce longer or shorter hypotheses given a.
The implementation in the Hugging Face transformers may be slightly different, but there is such a which you can pass as “longh_penalty” to the generate function, as in the following example (adapted from transformer documentation):
from transformers import AutoTokenizer, AutoModelForCausalLM#Download and load the tokenizer and model for gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
#Prompt that will initiate the inference
prompt = "Today I believe we can finally"
#Encoding the prompt with tokenizer
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
#Generate up to 30 tokens
outputs = model.generate(input_ids, length_penalty=0.5, num_beams=4, max_length=20)
#Decode the output into something readable
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
“num_beams” in this code example is our other hyperparameter k.
With this code example, the message “Today I think we finally can”, k = 4 and α = 0.5, we get:
outputs = model.generate(input_ids, length_penalty=0.5, num_beams=4, max_length=20)
Today I believe we can finally get to the point where we can make the world a better place.
With k=50 and α=1.0, we obtain:
outputs = model.generate(input_ids, length_penalty=1.0, num_beams=50, max_length=30)
Today I believe we can finally get to where we need to be," he said.\n\n"
You can see that the results are not exactly the same.
what Y a should be set independently in your target taskusing some development dataset.
Let’s take a concrete example in machine translation to see how to do a simple grid search to find the best hyperparameters and their impact on a real use case.
For these experiments, I use Marian with a machine translation model trained on the TILDE corpus FAST (CC-BY 4.0) to perform the translation from French to English.
I used only the first 100k lines of the dataset for training and the last 6k lines as devtest. I divided the devtest into two parts of 3k lines each: the first part is used for validation and the second part is used for evaluation. Note: the RAPID corpus has its phrases arranged alphabetically. So my train/devtest split is not ideal for a realistic use case. I recommend shuffling the lines of the corpus, preserving the sentence pairs, before splitting the corpus. In this article, I kept the alphabetical order, and did not shuffle, to make the following experiments more reproducible.
I evaluate the quality of the translation with the metrics KITE (Apache License 2.0).
To find the best pair of values for what Y a With grid search, we first need to define a set of values for each hyperparameter, and then try all possible combinations.
Since we are looking for decoding hyperparameters here, this search is quite fast and easy in contrast to the search for training hyperparameters.
The value sets I chose for this task are as follows:
- what: {1,2,4,10,20,50,100}
- a: {0.5,0.6,0.7,0.8,1.0,1.1,1.2}
I bold the most common values used in the default machine translation. For most natural language generation tasks, these sets of values should be tested, except maybe k=100, which is often unlikely to give the best results while being expensive to decode.
We have 7 values for what and 7 values for a. We want to test all combinations, so we need to do 7*7=49 decrypts of the test dataset.
We can do that with a simple bash script:
for k in 1 2 4 10 20 50 100 ; do
for a in 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 ; do
marian-decoder -m model.npz -n $a -b $k -c model.npz.decoder.yml < test.fr > test.en
done;
done;
Then, for each decode output, we run COMET to assess the quality of the translation.
With all the results we can draw the following table of COMET scores for each pair of values:
As you can see, the result obtained with the default hyperparameter (underlined) is less than 26 of the other results obtained with other hyperparameter values.
Actually, all results in bold are statistically significantly better than the default. Note: In these experiments, I am using the test set to calculate the results that I showed in the table. In a realistic scenario, these results should be calculated on another development/validation set to decide the pair of values to use in the test set or for real world applications.
Therefore, for your applications, it is definitely worth tuning the decoding hyperparameters to get better results at the cost of very little engineering effort.
In this article, we only play with two beam search hyperparameters. Many more need to be refined.
Other decoding algorithms, such as core sampling and temperature, have hyperparameters that you may want to examine instead of using the default ones.
Obviously, as we increase the number of hyperparameters to fit, grid search becomes more expensive. Only you experience and experiments with your application it will tell you whether a particular hyperparameter is worth tuning and which values are most likely to produce satisfactory results.