Domain adaptation of a large language model | by Mina Ghashami | November 2023

Adapt a pre-trained model to a new domain using HuggingFace

Large language models (LLMs) like BERT are typically pre-trained on domain-general corpora like Wikipedia and BookCorpus. If we apply them to more specialized domains such as medical, there is usually a drop in performance compared to the models. adapted for those domains.

In this article, we will explore how to adapt a pre-trained LLM such as the foundation of Deberta to the medical domain using the HuggingFace Transformers library. Specifically, we will cover an effective technique called intermediate pre-training in which we perform additional pre-training of the LLM with data from our target domain. This adapts the model to the new domain and improves its performance.

This is a simple but effective technique to tailor LLMs to your domain and gain significant improvements in subsequent task performance.

Let us begin.

The first step in any project is to prepare the data. Since our dataset belongs to the medical domain, it contains the following fields and many more:

Putting the complete list of fields here is impossible, since there are many fields. But even this look at the existing fields helps us form the entry sequence for an LLM.

The first point to note is that the input must be a sequence because LLMs read the input as sequences of text.

To form this into a sequence, we can inject special tags to tell the LLM what information comes next. Consider the following example: <patient>name:John, surname: Doer, patientID:1234, age:34</patient> he <patient> is a special label that tells LLM that what follows is information about a patient.

Then we form the input sequence as follows:

As you can see, we have injected four tags:

<patient> </patient>: contain…

Domain adaptation of a large language model | by Mina Ghashami | November 2023

Technical Terrence Team

Farfetch Investors Face December Deadline to File Lead Plaintiff in Lawsuit By Investing.com

Leave a Reply Cancel reply

Recommended.

Devcon4 Announcement | Ethereum Foundation Blog

Meet LLM AutoEval: an artificial intelligence platform that automatically evaluates your LLMs on Google Colab

Ethereum Surpasses Bitcoin with Over 1 Million Daily Active Addresses

Customer service satisfaction of electric vehicle owners is lower than that of ICE vehicles

The Morning After: Tesla Recalls More Than 2 Million Cars Over Autopilot Safety

Categories

Important Links

Domain adaptation of a large language model | by Mina Ghashami | November 2023

Adapt a pre-trained model to a new domain using HuggingFace

Related

Technical Terrence Team

Farfetch Investors Face December Deadline to File Lead Plaintiff in Lawsuit By Investing.com

Leave a Reply Cancel reply

Recommended.

Devcon4 Announcement | Ethereum Foundation Blog

Meet LLM AutoEval: an artificial intelligence platform that automatically evaluates your LLMs on Google Colab

Ethereum Surpasses Bitcoin with Over 1 Million Daily Active Addresses

Customer service satisfaction of electric vehicle owners is lower than that of ICE vehicles

The Morning After: Tesla Recalls More Than 2 Million Cars Over Autopilot Safety

Categories

Important Links

Get daily news updates to your inbox!