Stepping out of the “comfort zone” — part 3/3 of a deep-dive into domain adaptation approaches for LLMs
Exploring domain adapting large language models (LLMs) to your specific domain or use case? This 3-part blog post series explains the motivation for domain adaptation and dives deep into various options to do so. Further, a detailed guide for mastering the entire domain adaptation journey covering popular tradeoffs is being provided.
Part 1: Introduction into domain adaptation — motivation, options, tradeoffs
Part 2: A deep dive into in-context learning
Part 3: A deep dive into fine-tuning — You’re here!
Note: All images, unless otherwise noted, are by the author.
In the previous part of this blog post series, we explored the concept of in-context learning as a powerful approach to overcome the “comfort zone” limitations of large language models (LLMs). We discussed how these techniques can be used to transform tasks and move them back into the models’ areas of expertise, leading to improved performance and alignment with the key design principles of Helpfulness, Honesty, and Harmlessness. In this third part, we will shift our focus to the second domain adaptation approach: fine-tuning. We will dive into the details of fine-tuning, exploring how it can be leveraged to expand the models’ “comfort zones” and hence uplift performance by adapting them to specific domains and tasks. We will discuss the trade-offs between prompt engineering and fine-tuning, and provide guidance on when to choose one approach over the other based on factors such as data velocity, task ambiguity, and other considerations.
Most state-of-the-art LLMs are powered by the transformer architecture, a family of deep neural network architectures which has disrupted the field of NLP after being proposed by Vaswani et al in 2017, breaking all common benchmarks across the domain. The core differentiator of this architecture family is a concept called “attention” which excels in capturing the semantic meaning of words or larger pieces of natural language based on the context they are used in.
The transformer architecture consists of two fundamentally different building blocks. On the one side, the “encoder” block focuses on translating the semantics of natural language into so-called contextualized embeddings, which are mathematical representations in the vector space. This makes encoder models particularly useful in use cases utilizing these vector representations for downstream deterministic or probabilistic tasks like classification problems, NER, or semantic search. On the other side, the decoder block is trained on next-token prediction and hence capable of generatively producing text if used in a recursive manner. They can be used for all tasks relying on the generation of text. These building blocks can be used independently of each other, but also in combination. Most of the models referred to within the field of generative ai today are decoder-only models. This is why this blog post will focus on this type of model.
Fine-tuning leverages transfer learning to efficiently inject niche expertise into a foundation model like LLaMA2. The process involves updating the model’s weights through training on domain-specific data, while keeping the overall network architecture unchanged. Unlike full pre-training which requires massive datasets and compute, fine-tuning is highly sample and compute efficient. On a high level, the end-to-end process can be broken down into the following phases:
- Data collection and selection: The set of proprietary data to be ingested into the model needs to be carefully selected. On top of that, for specific fine-tuning purposes data might not be available yet and has to be purposely collected. Depending on the data available and task to be achieved through fine-tuning, data of different quantitative or qualitative characteristics might be selected (e.g. labeled, un-labeled, preference data — see below) Besides the data quality aspect, dimensions like data source, confidentiality and IP, licensing, copyright, PII and more need to be considered.
LLM pre-training usually leverages a mix of web scrapes and curated corpora, the nature of fine-tuning as a domain adaptation approach implies that the datasets used are mostly curated corpora of labeled or unlabelled data specific to an organizational, knowledge, or task-specific domain.
While this data can be sourced differently (document repositories, human-created content, etc.), this underlines that for fine-tuning, it is important to carefully select the data with respect to quality, but as mentioned above, also consider topics like confidentiality and IP, licensing, copyright, PII, and others.
In addition to this, an important dimension is the categorization of the training dataset into unlabelled and labeled (including preference) data. Domain adaptation fine-tuning requires unlabelled textual data (as opposed to other fine-tuning approaches, see the figure 4). In other words, we can simply use any full-text documents in natural language that we consider to be of relevant content and sufficient quality. This could be user manuals, internal documentation, or even legal contracts, depending on the actual use case.
On the other hand, labeled datasets like an instruction-context-response dataset can be used for supervised fine-tuning approaches. Lately, reinforcement learning approaches for aligning models to actual user feedback have shown great results, leveraging human- or machine-created preference data, e.g., binary human feedback (thumbs up/down) or multi-response ranking.
As opposed to unlabeled data, labeled datasets are more difficult and expensive to collect, especially at scale and with sufficient domain expertise. Open-source data hubs like HuggingFace Datasets can be good sources for labeled datasets, especially in areas where the broader part of a relevant human population group agrees (e.g., a toxicity dataset for red-teaming), and using an open-source dataset as a proxy for the model’s real users’ preferences is sufficient.
Still, many use cases are more specific and open-source proxy datasets are not sufficient. This is when datasets labeled by real humans, potentially with significant domain expertise, are required. Tools like amazon.com/sagemaker/groundtruth/” rel=”noopener ugc nofollow” target=”_blank”>amazon SageMaker Ground Truth can help with collecting the data, be it by providing fully managed user interfaces and workflows or the entire workforce.
Recently, synthetic data collection has become more and more a topic in the space of fine-tuning. This is the practice of using powerful LLMs to synthetically create labeled datasets, be it for SFT or preference alignment. Even though this approach has already shown promising results, it is currently still subject to further research and has to prove itself to be useful at scale in practice.
- Data pre-processing: The selected data needs to be pre-processed to make it “well digestible” for the downstream training algorithm. Popular pre-processing steps are the following:
- Quality-related pre-processing, e.g. formatting, deduplication, PII filtering
- Fine-tuning approach related pre-processing: e.g. rendering into prompt templates for supervised fine-tuning
- NLP-related pre-processing, e.g. tokenisation, embedding, chunking (according to context window)
- Model training: training of the deep neural network according to selected fine-tuning approach. Popular fine-tuning approaches we will discuss in detail further below are:
- Continued pre-training aka domain-adaptation fine-tuning: training on full-text data, alignment tied to a next-token-prediction task
- Supervised fine-tuning: fine-tuning approach leveraging labeled data, alignment tied towards the target label
- Preference-alignment approaches: fine-tuning approach leveraging preference data, aligning to a desired behaviour defined by the actual users of a model / system
Subsequently, we will dive deeper into the single phases, starting with an introduction to the training approach and different fine-tuning approaches before we move over to the dataset and data processing requirements.
In this section we will explore the approach for training decoder transformer models. This applies for pre-training as well as fine-tuning.
As opposed to traditional ML training approaches like unsupervised learning with unlabeled data or supervised learning with labeled data, training of transformer models utilizes a hybrid approach referred to as self-supervised learning. This is because although being fed with unlabeled textual data, the algorithm is actually intrinsically supervising itself by masking specific input tokens. Given the below input sequence of tokens “Berlin is the capital of Germany.”, this natively leads into a supervised sample with y being the masked token and x being the rest.
The above-mentioned self-supervised training approach is optimizing the model weights towards a language modeling (LM) specific loss function. While encoder model training is utilizing Masked Language Modeling (MLM) to leverage a bi-directional context by randomly masking tokens, decoder-only models are tied towards a Causal Language Modeling (CLM) approach with a uni-directional context by always masking the rightmost token of a sequence. In simple words, this means that they are trained towards predicting the subsequent token in an auto-regressive manner based on the previous ones as semantic context. Beyond this, other LM approaches like Permutation Language Modelling (PLM) exist, where a model is conditioned towards bringing a sequence of randomly shuffled tokens back into sorted order.
By using the CLM task as a proxy, a prediction and ground truth are created which can be utilized to calculate the prediction loss. Therefore, the predicted probability distribution over all tokens of a model’s vocabulary is compared to the ground truth, a sparse vector with a probability of 1.0 for the token representing the ground truth. The actual loss function used depends on the specific model architecture, but loss functions like cross-entropy or perplexity loss, which perform well in categorical problem spaces like token prediction, are commonly used. The loss function is leveraged to gradually minimize the loss and hence optimize the model weights towards our training goal with every iteration through performing gradient descent in the deep neural network backpropagation.
Enough of theory, let’s move into practice. Let’s assume you are an organization from the BioTech domain, aiming to leverage an LLM, let’s say LLaMA2, as a foundation model for various NLP use cases around COVID-19 vaccine research. Unfortunately, there are quite a few dimensions in which this domain is not part of the “comfort zone” of general-purpose off-the-shelf pre-trained LLMs, leading to performance being below your expected bar. In the next sections, we will discuss different fine-tuning approaches and how they can help elevate LLaMA2’s performance above the bar in various dimensions in our fictitious scenario.
As the headline indicates, while the field starts to converge into the term “continued pre-training” a definite term for the fine-tuning approach discussed in this sections has yet to be agreed on by community. But what is this fine-tuning approach really about?
Research papers in the BioTech domain are quite peculiar in writing style, full of domain-specific knowledge and industry- or even organisation-specific acronyms (e.g. Polack et al, 2020; see Figure 7).
On the other hand, a detailed look into the pre-training dataset mixtures of the Meta LLaMA models (Touvron et al., 2023; Figure 8) and the TII Falcon model family (Almazrouei et al., 2023; Figure 9) indicates that with 2.5% and 2%, general-purpose LLMs contain only a very little portion of data from the research or even BioTech domain (pre-training data mixture of LLaMA 3 family not public at the time of blog publication).
Hence, we need to bridge this gap by utilizing fine-tuning to expand the model’s “comfort zone” for better performance on the specific tasks to carry out. Continued pre-training excels at exactly the above-mentioned dimensions. It involves the process of adjusting a pre-trained LLM on a specific dataset consisting of plain textual data. This technique is helpful for infusing domain-specific information like linguistic patterns (domain-specific language, acronyms, etc.) or information implicitly contained in raw full-text into the model’s parametric knowledge to align the model’s responses to fit this specific language or knowledge domain. For this approach, pre-trained decoder models are fine-tuned on next-token prediction using unlabeled textual data. This makes continued pre-training the most similar fine-tuning approach to pre-training.
In our example, we could use the content of the mentioned paper together with related literature from a similar field and convert it into a concatenated textual file. Depending on the tuning goal and other requirements, data curation steps like removal of unnecessary content (e.g., authors, tables of content, etc.), deduplication, or PII reduction can be applied. Finally, the dataset undergoes some NLP-specific preprocessing (e.g., tokenization, chunking according to the context window, etc. — see above), before it is used for training the model. The training itself is a classic CLM-based training as discussed in the previous section. After having adapted LLaMA2 with continued pre-training on a set of research publications from the BioTech domain, we can now utilize it in this specific domain as a text-completion model “BioLLaMA2.”
Unfortunately, we humans don’t like to frame the problems we want to get solved in a pure text-completion/token-prediction form. Instead, we are a conversational species with a tendency towards chatty or instructive behavior, especially when we are aiming to get things done.
Hence, we require some sophistication beyond simple next-token prediction in the model’s behavior. This is where supervised fine-tuning approaches come into the game. Supervised fine-tuning (SFT) involves the process of aligning a pre-trained LLM on a specific dataset with labeled examples. This technique is essential for tailoring the model’s responses to fit particular domains or tasks, e.g., the above-mentioned conversational nature or instruction following. By training on a dataset that closely represents the target application, SFT allows the LLM to develop a deeper understanding and produce more accurate outputs in line with the specialized requirements and behaviour.
Beyond the above-mentioned ones, good examples of SFT can be the training of the model for Q&A, a data extraction task such as entity recognition, or red-teaming to prevent harmful responses.
As we understood above, SFT requires a labeled dataset. There are plenty of general-purpose labeled datasets in open-source, however, to tailor the model best to your specific use case, industry, or knowledge domain, it can make sense to manually craft a custom one. Recently, the approach of using powerful LLMs like Claude 3 or GPT-4 for crafting such datasets has evolved as a resource- and time-effective alternative to human labelling.
The “dolly-15k” dataset is a popular general-purpose open-source instruct fine-tuning dataset manually crafted by Databricks’ employees. It consists of roughly 15k examples of an instruction and a context labeled with a desired response. This dataset could be used to align our BioLLaMA2 model towards following instructions, e.g. for a closed Q&A task. For SFT towards instruction following, we would proceed and convert every single item of the dataset into a full-text prompt, embedded into a prompt structure representing the task we want to align the model towards. This could look as follows:
### Instruction:
{item.instruction}
### Context:
{item.context}
### Response:
{item.response}
The prompt template can vary depending on the model family, as some models prefer HTML tags or other special characters over hashtags. This procedure is being applied for every item of the dataset before all of them are concatenated into a large piece of text. Finally, after the above-explained NLP-specific preprocessing, this file can be trained into the model by utilizing next-token prediction and a CLM-based training objective. Since it is consistently being exposed to this specific prompt structure, the model will learn to stick to it and act in a respective manner — in our case, instruction following. After aligning our BioLLaMA2 to the dolly-15k dataset, our BioLLaMA2-instruct model will thoroughly follow instructions submitted through the prompt.
With BioLLaMA2 we have a model adapted to the BioTech research domain, following our instructions conveniently to what our users expect. But wait — is the model really aligned with our actual users? This highlights a core problem with the fine-tuning approaches discussed so far. The datasets we have used are proxies for what we think our users like or need: the content, language, acronyms from the selected research papers, as well as the desired instruct-behavior of a handful of Databricks employees crafting dolly-15k. This contrasts with the concept of user-centric product development, one of the core and well-established principles of agile product development. Iteratively looping in feedback from actual target users has proven to be highly successful when developing great products. In fact, this is definetly something we want to do if we are aiming to build a great experience for your users!
With this in mind, researchers have put quite some effort into finding ways to incorporate human feedback into improving the performance of LLMs. On the path towards that, they realized a significant overlap with (deep) reinforcement learning (RL), which deals with autonomous agents performing actions in an action space within an environment, producing a next state, which is always coupled to a reward. The agents are acting based on a policy or a value-map, which has been gradually optimized towards maximizing the reward during the training phase.
This concept — projected into the world of LLMs — comes down to the LLM itself acting as the agent. During inference, with every step of its auto-regressive token-prediction nature, it performs an action, where the action space is the model’s vocabulary, and the environment is all possible token combinations. With every new inference cycle, a new state is established, which is honored with a reward that is ideally correlated with some human feedback.
Based on this idea, several human preference alignment approaches have been proposed and tested. In what follows, we will walk through some of the most important ones:
Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO)
Reinforcement learning from human feedback was one of the major hidden technical backbones of the early Generative ai hype, giving the breakthrough achieved with great large decoder models like Anthropic Claude or GPT-3.5 an additional boost into the direction of user alignment.
RLHF works in a two-step process and is illustrated in Figures 13 and 14:
Step 1 (Figure 13): First, a reward model needs to be trained for later usage in the actual RL-powered training approach. Therefore, a prompt dataset aligned with the objective (in the case of our BioLLaMA2-instruct model, this would be pairs of an instruction and a context) to optimize is being fed to the model to be fine-tuned, while requesting not only one but two or more inference results. These results will be presented to human labelers for scoring (1st, 2nd, 3rd, …) based on the optimization objective. There are also a few open-sourced preference ranking datasets, among them “Anthropic/hh-rlhf”which is tailored towards red-teaming and the objectives of honesty and harmlessness. After a normalization step as well as a translation into reward values, a reward model is being trained based on the single sample-reward pairs, where the sample is a single model response. The reward model architecture is usually similar to the model to be fine-tuned, adapted with a small head eventually projecting the latent space into a reward value instead of a probability distribution over tokens. However, the ideal sizing of this model in parameters is still subject to research, and different approaches have been chosen by model providers in the past.
Step 2 (Figure 14): Our new reward model is now used for training the actual model. Therefore, another set of prompts is fed through the model to be tuned (grey box in illustration), resulting in one response each. Subsequently, these responses are fed into the reward model for retrieval of the individual reward. Then, Proximal Policy Optimization (PPO), a policy-based RL algorithm, is used to gradually adjust the model’s weights in order to maximize the reward allocated to the model’s answers. As opposed to CLM, instead of gradient descent, this approach leverages gradient ascent (or gradient descent over 1 — reward) since we are now trying to maximize an objective (reward). For increased algorithmic stability to prevent too heavy drifts in model behavior during training, which can be caused by RL-based approaches like PPO, a prediction shift penalty is being added to the reward term, penalizing answers diverging too much from the initial language model’s predicted probability distribution on the same input prompt.
Beyond RLHF with PPO, which currently is the most widely adopted and proven approach to preference alignment several other approaches have been developed. In the next couple of sections we will dive deep into some of these approaches on an advanced level. This is for advanced readers only, so depending on your level of experience with deep learning and reinforcement learning you might want to skip directly to the next section “Decision flow chart — which model to choose, which fine-tuning path to pick”.
Direct Policy Optimization (DPO)
Direct Policy Optimization (DPO) is a preference alignment approach deducted from RLHF, tackling two major downsides of it:
- Training a reward model first is additional resource investment and can be significant depending on the reward model size
- The training phase of RLHF with PPO requires massive compute clusters since three replicas of the model (initial LM, tuned LM, reward model) need to be hosted and orchestrated simultaneously in a low latency setup
- RLHF can be an unstable procedure (→ prediction shift penalty tries to mitigate this)
DPO is an alternative preference alignment approach and was proposed by Rafailov et al. in 2023. The core idea of DPO is to skip the reward model training and tune the final preference-aligned LLM directly on the preference data. This is being achieved by applying some mathematical tweaks to transform the parameterization of the reward model (reward term) into a loss function (figure 16) while replacing the actual reward values with probability values over the preference data.
This saves computational as well as algorithmic complexity on the way towards a preference-aligned model. While the paper is also showing performance increases as compared to RLHF, this approach is fairly recent and hence the results are subject to practical proof.
Kahneman-Tversky Optimization (KTO)
Existing methods for aligning language models with human feedback, such as RLHF and DPO, require preference data — pairs of outputs where one is preferred over the other for a given input. However, collecting high-quality preference data at scale is challenging and expensive in the real world. Preference data often suffers from noise, inconsistencies, and intransitivities, as different human raters may have conflicting views on which output is better. KTO was proposed by Ethayarajh et al. (2024) as an alternative approach that can work with a simpler, more abundant signal — just whether a given output is desirable or undesirable for an input, without needing to know the relative preference between outputs.
At a high level, KTO works by defining a reward function that captures the relative “goodness” of a generation, and then optimizing the model to maximize the expected value of this reward under a Kahneman-Tversky value function. Kahneman and Tversky’s prospect theory explains how humans make decisions about uncertain outcomes in a biased but well-defined manner. The theory posits that human utility depends on a value function that is concave in gains and convex in losses, with a reference point that separates gains from losses (see figure 17). KTO directly optimizes this notion of human utility, rather than just maximizing the likelihood of preferences.
The key innovation is that KTO only requires a binary signal of whether an output is desirable or undesirable, rather than full preference pairs. This allows KTO to be more data-efficient than preference-based methods, as the binary feedback signal is much more abundant and cheaper to collect. (see figure 18)
KTO is particularly useful in scenarios where preference data is scarce or expensive to collect, but you have access to a larger volume of binary feedback on the quality of model outputs. According to the paper, it can match or even exceed the performance of preference-based methods like DPO, especially at larger model scales. However, this needs to be validated at scale in practice. KTO may be preferable when the goal is to directly optimize for human utility rather than just preference likelihood. However, if the preference data is very high-quality with little noise or intransitivity, then preference-based methods could still be the better choice. KTO also has theoretical advantages in handling extreme data imbalances and avoiding the need for supervised fine-tuning in some cases.
Odds Ration Preference Optimization (ORPO)
The key motivation behind ORPO is to address the limitations of existing preference alignment methods, such as RLHF and DPO, which often require a separate supervised fine-tuning (SFT) stage, a reference model, or a reward model. The paper by Hong et al. (2024) argues that SFT alone can inadvertently increase the likelihood of generating tokens in undesirable styles, as the cross-entropy loss does not provide a direct penalty for the disfavored responses. At the same time, they claim that SFT is vital for converging into powerful preference alignment models. This leads to a two-stage alignment process heavily incurring resources. By combining these stages into one, ORPO aims to preserve the domain adaptation benefits of SFT while concurrently discerning and mitigating unwanted generation styles as aimed towards by preference-alignment approaches. (see figure 19)
ORPO introduces a novel preference alignment algorithm that incorporates an odds ratio-based penalty to the conventional causal language modeling tied loss (e.g., cross-entropy loss). The objective function of ORPO consists of two components: the SFT loss and the relative ratio loss (LOR). The LOR term maximizes the odds ratio between the likelihood of generating the favored response and the disfavored response, effectively penalizing the model for assigning high probabilities to the rejected responses.
ORPO is particularly useful when you want to fine-tune a pre-trained language model to adapt to a specific domain or task while ensuring that the model’s outputs align with human preferences. It can be applied in scenarios where you have access to a pairwise preference dataset (yw = favored, yl = disfavored, such as the UltraFeedback or HH-RLHF datasets. With this in mind, ORPO is designed to be a more efficient and effective alternative to RLHF and DPO, as it does not require a separate reference model, reward model or a two-step fine-tuning approach.
After diving deep into plenty of fine-tuning approaches, the obvious question arises as to which model to start with and which approach to pick best based on specific requirements. The approach for picking the right model for fine-tuning purposes is a two-step approach. The first step is very similar to picking a base model without any fine-tuning intentions, including considerations alongside the following dimensions (not exhaustive):
- Platform to be used: Every platform comes with a set of models accessible through it. This needs to be taken into consideration. Please note that region-specific differences in model availability can apply. Please check the respective platform’s documentation for more information on this.
- Performance: Organizations should aim to use the leanest model for a specific task. While no generic guidance on this can be given and fine-tuning can significantly uplift a model’s performance (smaller fine-tuned models can outperform larger general-purpose models), leveraging evaluation results of base models can be helpful as an indicator.
- Budget (TCO): In general, larger models require more compute and potentially multi-GPU instances for training and serving across multiple accelerators. This has a direct impact on factors like training and inference cost, complexity of training and inference, resources and skills required, etc., as part of TCO along a model’s entire lifecycle. This needs to be aligned with the short- and long-term budget allocated.
- Licensing model: Models, wether proprietary or open-source ones come with licensing constraints depending on the domain of usage and commercial model to be used. This needs to be taken into account.
- Governance, Ethics, Responsible ai: Every organisation has compliance guidelines alongside these dimensions. This needs to be considered in the model’s choice.
Example: An organisation might decide to consider LLaMA 2 models and rule out the usage of proprietary models like Anthropic Claude or AI21Labs Jurassic based on evaluation results of the base models. Further, they decide to only use the 7B-parameter version of this model to be able to train and serve them on single GPU instances.
The second step is concerned with narrowing down the initial selection of models to 1-few models to be taken into consideration for the experimenting phase. The final decision on which specific approach to choose is dependent on the desired entry point into the fine-tuning lifecycle of language models illustrated in the below figure.
Thereby, the following dimensions need to be taken into consideration:
- Task to be performed: Different use cases require specific model behaviour. While for some use cases a simple text-completion model (next-token-prediction) might be sufficient, most use cases require task-specific behaviour like chattiness, instruction-following or other task-specific behaviour. To meet this requirement, we can take a working backwards approach from the desired task to be performed. This means we need to define our specific fine-tuning journey to end at a model aligned to this specific task. With regards to the illustration this implies that the model must — aligned with the desired model behaviour — end in the blue, orange or green circle while the fine-tuning journey is defined alongside the possible paths of the flow diagram.
- Choose the right starting point (as long as reasonable): While we should be very clear on where our fine-tuning journey should end, we can start anywhere in the flow diagram by picking a respective base model. This however needs to be reasonable — in times of model hubs with millions of published models, it can make sense to check if the fine-tuning step has not already been performed by someone else who shared the resulting model, especially when considering popular models in combination with open-source datasets.
- Fine-tuning is an iterative, potentially recursive process: It is possible to perform multiple subsequent fine-tuning jobs on the way to our desired model. However, please note that catastrophic forgetting is something we need to keep in mind as models can’t encode an infinite amount of information in their weights. To mitigate this, you can leverage parameter-efficient fine-tuning approaches like LoRA as shown in this paper and blog.
- Task-specific performance uplift targeted: Fine-tuning is performed to uplift a model’s performance in a specific task. If we are looking for performance uplift in linguistic patterns (domain-specific language, acronyms, etc.) or information implicitly contained in your training data, continued pre-training is the right choice. If we want to uplift performance towards a specific task, supervised fine-tuning should be chosen. If we want to align your model behaviour towards our actual users, human preference alignment is the right choice.
- Data availability: Training data will also influence which path we choose. In general, organisations hold larger amounts of unlabelled textual data is as opposed to labelled data, and acquiring labelled data can be an expensive task. This dimension needs to be taken into consideration when navigating through the flow chart.
With this working backwards approach alongside the above flow chart we can identify the model to start with and the path to take while traversing the fine-tuning flow diagram.
To make this a bit more obvious we are providing two examples:
Example 1: Following the example illustrated in the fine-tuning section above, we could constitute the desire of having an instruct model for our specific use case, aligned to our actual user’s preferences. However, we want to uplift performance in the BioTech domain. Unlabelled data in the form of research papers are available. We choose the LLaMA-2–7b model family as the desired starting point. Since Meta has not published an LLaMA-2–7b instruct model, we start from the text completion model LLaMA-2–7b-base. Then we perform continued pre-training on the corpus of research papers, followed by supervised fine-tuning on an open-source instruct dataset like the dolly-15k dataset. This results in an instruct-fine-tuned BioTech version of LLaMA-2–7B-base, which we call BioLLaMA-2–7b-instruct. In the next step, we want to align the model to our actual users’ preferences. We collect a preference dataset, train a reward model, and use RLHF with PPO to preference-align our model.
Example 2: In this example we are aiming to use a chat model, however aligned to our actual user’s preferences. We choose the LLaMA-2–7b model family as the desired starting point. We figure out that Meta is providing an off-the-shelf chat-fine-tuned model LLaMA-2–7b-chat, which we can use as a starting point. In the next step, we want to align the model to our actual user’s preferences. We collect a preference dataset from our users, train a reward model and use RLHF with PPO to preference-align our model.
Generative ai has many exciting use cases for businesses and organizations. However, these applications are usually much more complex than individual consumer uses like generating recipes or speeches. For companies, the ai needs to understand the organization’s specific domain knowledge, processes, and data. It must integrate with existing enterprise systems and applications. And it needs to provide a highly customized experience for different employees and roles while acting in a harmless way. To successfully implement generative ai in an enterprise setting, the technology must be carefully designed and tailored to the unique needs of the organization. Simply using a generic, publicly-trained model won’t be sufficient.
In this blog post we discussed how domain adaptation can help bridging this gap by overcoming situations where a model is confronted with tasks outside of its “comfort zone”. With in-context learning and fine-tuning we dived deep into two powerful approaches for domain adaptation. Finally, we discussed tradeoffs to take when deciding between these approaches.
Successfully bridging this gap between powerful ai capabilities and real-world business requirements is key for unlocking the full potential of generative ai for companies.