The appearance of BERT The model led to significant progress in NLP. Deriving its architecture from TransformerBERT achieves state-of-the-art results in various downstream tasks: language modeling, next sentence prediction, question answering, NER tagging, etc.
Despite BERT’s excellent performance, researchers continued to experiment with its configuration in hopes of achieving even better metrics. Fortunately they succeeded and presented a new model called Roberta – Robustly optimized BERT approach.
Throughout this article we will refer to the official version. Roberta Paper which contains detailed information about the model. In simple words, RoBERTa consists of several independent improvements over the original BERT model; all other principles, including architecture, remain the same. All advances will be covered and explained in this article.
From the BERT architecture we remember that during pre-training BERT performs language modeling trying to predict a certain percentage of masked tokens. The problem with the original implementation is the fact that the tokens chosen to mask a given text sequence in different batches are sometimes the same.
More precisely, the training data set is duplicated 10 times, so each sequence is masked in only 10 different ways. Considering that BERT runs 40 training epochs, each sequence with the same masking is passed to BERT four times. As the researchers discovered, it is slightly better to use dynamic masking, meaning that the masking is generated uniquely each time a sequence is passed to BERT. In general, this results in less duplicate data during training, giving a model the opportunity to work with more diverse data and masking patterns.
The authors of the paper conducted research to find an optimal way to model the following sentence prediction task. As a result, they found several valuable insights:
- Removing the prediction loss from the next sentence results in slightly better performance.
- Passing single natural sentences to the BERT input hurts performance, compared to passing sequences consisting of multiple sentences.. One of the most likely hypotheses explaining this phenomenon is the difficulty of a model in learning long-range dependencies based solely on single sentences.
- It is more beneficial to construct input sequences by sampling contiguous sequences. sentences from a single document instead of multiple documents. Typically, sequences are always constructed from contiguous complete sentences from a single document, so that the total length is at most 512 tokens. The problem arises when we reach the end of a document. In this aspect, the researchers compared whether it was worth stopping sentence sampling for such sequences or, in addition, sampling the first sentences of the next document (and adding a corresponding separator token between documents). The results showed that the first option is better.
Finally, for the final implementation of RoBERTa, the authors chose to keep the first two aspects and omit the third. Despite the improvement seen behind the third idea, the researchers did not pursue it further because otherwise it would have made comparison between previous implementations more problematic. This happens due to the fact that reaching the document limit and stopping there means that an input stream will contain less than 512 tokens. To have a similar number of tokens in all batches, it is necessary to increase the batch size in such cases. This leads to variable batch sizes and more complex comparisons that the researchers wanted to avoid.
Recent advances in NLP showed that increasing the batch size with appropriate decreasing of the learning rate and the number of training steps generally tends to improve model performance.
As a reminder, the base BERT model was trained on a batch size of 256 sequences for one million steps. The authors attempted to train BERT on batch sizes of 2K and 8K and chose the latter value to train RoBERTa. The corresponding number of training steps and the learning rate value were respectively converted to 31K and 1e-3.
It is also important to note that increasing the batch size results in easier parallelization through a special technique called “gradient accumulation”.
In NLP there are three main types of text tokenization:
- Character-level tokenization
- Subword level tokenization
- Word-level tokenization
The original BERT uses subword-level tokenization with a vocabulary size of 30K that is learned after input preprocessing and using various heuristics. RoBERTa uses bytes instead of Unicode characters as the basis for subwords and extends the vocabulary size up to 50 KB without any preprocessing or input tokenization. This results in additional parameters of 15M and 20M for the base BERT and large BERT models, respectively. The encoding version introduced in RoBERTa shows slightly worse results than before.
However, the growth of vocabulary size in RoBERTa allows almost any word or subword to be encoded without using the unknown token, compared to BERT. This is a considerable advantage for RoBERTa, as the model can now better understand complex texts containing rare words.
Apart from this, RoBERTa applies the four aspects described above with the same architectural parameters as large BERT. The total number of RoBERTa parameters is 355M.
RoBERTa is pre-trained on a combination of five massive data sets resulting in a total of 160 GB of text data. In comparison, large BERT is pretrained on only 13 GB of data. Finally, the authors increase the number of training steps from 100K to 500K.
As a result, RoBERTa outperforms large BERT on large XLNet on the most popular benchmarks.
Analogous to BERT, researchers developed two versions of RoBERTa. Most of the hyperparameters in the basic and large versions are the same. The following figure demonstrates the main differences:
The tuning process in RoBERTa is similar to that of BERT.
In this article, we have examined an improved version of BERT that modifies the original training procedure by introducing the following aspects:
- dynamic masking
- skip prediction target of next sentence
- training in longer sentences
- increase vocabulary size
- train longer with larger batches on data
The resulting RoBERTa model appears to be superior to its predecessors at higher benchmarks. Despite a more complex setup, RoBERTa adds only 15 million additional parameters while maintaining inference speed comparable to BERT.
All images, unless otherwise noted, are the author’s.