Large language models (LLMs) like BERT are typically pre-trained on domain-general corpora like Wikipedia and BookCorpus. If we apply them to more specialized domains such as medical, there is usually a drop in performance compared to the models. adapted for those domains.
In this article, we will explore how to adapt a pre-trained LLM such as the foundation of Deberta to the medical domain using the HuggingFace Transformers library. Specifically, we will cover an effective technique called intermediate pre-training in which we perform additional pre-training of the LLM with data from our target domain. This adapts the model to the new domain and improves its performance.
This is a simple but effective technique to tailor LLMs to your domain and gain significant improvements in subsequent task performance.
Let us begin.
The first step in any project is to prepare the data. Since our dataset belongs to the medical domain, it contains the following fields and many more:
Putting the complete list of fields here is impossible, since there are many fields. But even this look at the existing fields helps us form the entry sequence for an LLM.
The first point to note is that the input must be a sequence because LLMs read the input as sequences of text.
To form this into a sequence, we can inject special tags to tell the LLM what information comes next. Consider the following example: <patient>name:John, surname: Doer, patientID:1234, age:34</patient>
he <patient>
is a special label that tells LLM that what follows is information about a patient.
Then we form the input sequence as follows:
As you can see, we have injected four tags:
<patient> </patient>
: contain…