“Small” large language models (LLMs) are quickly becoming a game-changer in the field of artificial intelligence.
Unlike traditional LLMs that require significant computational resources, these models are much smaller and more efficient. While their performance would be that of the largest, they can easily operate on standard devices like laptops and even go to the limit. This also means that they can be easily customized and integrated for use in your data set.
In this article, I will first explain the basic concepts and inner workings of the model tuning and alignment processes. Then, I'll walk you through the preference tuning process of Phi 2, a small LLM with 2 billion parameters, using a novel approach called Direct Preference Optimization (DPO).
Thanks to the small size of the model and optimization techniques such as quantization and QLoRA, we will be able to perform this process through Google Colab using the free T4 GPU. This requires some adaptation of the settings and hyperparameters used by Hugging Face to train its Zephyr 7B model.
Table of Contents:
- Why we need tuning and the mechanics of direct preference optimization (DPO)
1.1. Why we need to perfect an LLM
1.2. What is DPO and DPO vs RLHF?
1.3. Why use DPO?
1.2. How to implement DPO? - An overview of the key components of the DPO process
2.1. Hugging the Face Transformers Reinforcement Learning Library (TRL)
2.2. Preparing the data set
23. Microsoft's Phi2 model - Step by step guide to tune Phi2 on T4 GPU
- Final thoughts
Why do we need to perfect an LLM?
Although very capable, large language models (LLMs) have their limitations, especially in handling the most recent or specific domain knowledge captured in enterprise repositories. To address this, we have two options: