Image by freepik
Natural language processing, or NLP, is a field within artificial intelligence so that machines have the ability to understand textual data. Research in NLP has been around for a long time, but has only recently become more prominent with the introduction of big data and greater computational processing power.
With the field of NLP getting bigger, many researchers would try to improve the machine’s ability to better understand textual data. Through many advancements, many techniques are proposed and applied in the field of NLP.
This article will compare various techniques for processing text data in the field of NLP. This article will focus on analyzing RNN, Transformers, and BERT because it is the one often used in research. Let’s get into it.
Recurrent neural network o RNN was developed in 1980, but has recently gained traction in the field of NLP. RNN is a particular type within the neural network family that is used for sequential data or data that cannot be independent of each other. Examples of sequential data are time series, audio, or text sentence data – basically any type of data with a meaningful order.
RNNs differ from normal feedforward neural networks in that they process information differently. In normal feed-forward, information is processed following the layers. However, RNN uses a loop cycle in the information input as a consideration. To understand the differences, let’s look at the image below.
Image by author
As you can see, the RNN model implements a loop cycle during information processing. RNNs would consider the current and previous data input when processing this information. That is why the model is suitable for any type of sequential data.
If we take an example in text data, imagine we have the sentence “I wake up at 7 am” and we have the word as input. In the feedback neural network, when we reach the word “up”, the model would already forget the words “I”, “wake up” and “up”. However, the RNNs would use each output for each word and repeat them so that the model does not forget them.
In the field of NLP, RNNs are often used in many textual applications, such as text classification and generation. It is often used in word-level applications, such as part-of-speech tagging, next word generation, etc.
If we look deeper into RNNs in textual data, there are many types of RNNs. For example, the image below is of many-to-many types.
Image by author
By looking at the image above, we can see that the output of each step (time step in RNN) is processed step by step and each iteration always considers the previous information.
Another type of RNN used in many NLP applications is the encoder-decoder (Sequence-to-Sequence) type. The structure is shown in the image below.
Image by author
This structure introduces two parts that are used in the model. The first part is called Encoder, which is a part that receives a sequence of data and creates a new representation based on it. The representation would be used in the second part of the model, which is the decoder. With this structure, the entry and exit lengths do not necessarily have to be equal. The example use case is language translation, which often does not have the same length between input and output.
There are several benefits to using RNN to process natural language data, including:
- RNN can be used to process text input without length limitations.
- The model shares the same weights at all time steps, allowing the neural network to use the same parameter at each time step.
- Having the memory of past inputs makes RNN suitable for any sequential data.
But there are also several disadvantages:
- RNN is susceptible to gradients disappearing and exploding. This is where the gradient result is a value close to zero (disappears), causing the network weight to only be updated by a small amount, or the gradient result is so significant (explodes) that it assigns enormous importance unreal to the network.
- Long training time due to the sequential nature of the model.
- Short-term memory means that the model starts to forget the longer it is trained. There is an extension of RNN called LSTM to alleviate this problem.
Transformers is an NLP model architecture that attempts to solve the sequence-to-sequence tasks previously found in RNNs. As mentioned above, RNNs have problems with short-term memory. The longer the input, the more prominent the model was in forgetting the information. This is where the attention mechanism could help solve the problem.
The attention mechanism is introduced in the article by Bahdanau et al. (2014) to solve the long input problem, especially with encoder-decoder type RNNs. It would not explain the attention mechanism in detail. Basically, it is a layer that allows the model to focus on the critical part of the model input while having the output prediction. For example, the entered word “Clock” would have a high correlation with “Jam” in Indonesian if the task is translation.
The transformer model is introduced by Vaswani et al. (2017). The architecture is inspired by the RNN encoder-decoder and was built keeping the attention mechanism in mind and does not process data in sequential order. The general model of transformers is structured like the image below.
Transformer architecture (Vaswani et al. 2017)
In the above structure, transformers encode the data vector sequence into word embedding with positional encoding while using decoding to transform the data to the original format. With the attention mechanism in place, the encoding can be given importance based on the input.
Transformers offer few advantages compared to the other model, including:
- The parallelization process increases the speed of training and inference.
- Able to process longer inputs, offering better understanding of context.
There are still some disadvantages of the transformer model:
- High processing and computational demand.
- The attention mechanism may require the text to be split due to the length limit it can handle.
- Context could be lost if the division was done wrong.
BERT
BERT, or Bidirectional Encoder Representations of Transformers, is a model developed by Devlin et al. (2019) that involves two steps (pre-training and tuning) to create the model. If we compare, BERT is a transformer encoder stack (BERT Base has 12 layers while BERT Large has 24 layers).
The general development of the BERT model can be shown in the following image.
General BERT Procedures (Devlin et al. (2019)
Pre-training tasks start training the model at the same time, and once completed, the model can be fine-tuned for various subsequent tasks (question answering, classification, etc.).
What makes BERT special is that it is the first unsupervised bidirectional language model that is pre-trained with text data. BERT previously received prior training on Wikipedia and the book corpus, which consists of more than 3 billion words.
BERT is considered bidirectional because it does not read the input data sequentially (from left to right or vice versa), but the transformer encoder reads the entire sequence simultaneously.
Unlike directional models, which read text input sequentially (left to right or right to left), the Transformer encoder reads the entire sequence of words simultaneously. That is why the model is considered bidirectional and allows you to understand the entire context of the input data.
To achieve bidirectionality, BERT uses two techniques:
- Mask Language Model (MLM) — Word masking technique. The technique would mask 15% of the entered words and attempt to predict this masked word based on the unmasked word.
- Next Sentence Prediction (NSP) — BERT tries to learn the relationship between sentences. The model takes sentence pairs as input and tries to predict whether the subsequent sentence exists in the original document.
There are some advantages to using BERT in the field of NLP, including:
- BERT is easy to use for various pre-trained NLP downstream tasks.
- Bidirectional makes BERT better understand the context of the text.
- It is a popular model that has a lot of community support.
Although there are still some disadvantages, including:
- It requires high computational power and long training time to make some subsequent task adjustments.
- The BERT model could result in a large model that requires much more storage.
- It is best used for complex tasks since performance for simple tasks is not much different than using simpler models.
NLP has become more prominent recently and much research has focused on improving applications. In this article, we look at three frequently used NLP techniques:
- RNN
- Transformers
- BERT
Each of the techniques has its advantages and disadvantages, but in general we can see that the model evolves better.
Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.
Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.