The recent success of fine-tuning the instruction of pre-trained large language models (LLMs) for downstream tasks has attracted great interest in the artificial intelligence (ai) community. This is because it allows models to be aligned with human tastes. To ensure that these refined models adequately represent human preferences, methods such as Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) have been developed.
In supervised fine tuning (SFT), instructions are provided to pre-trained LLMs, allowing their customization to perform specific tasks. This not only ensures that they generate logical responses, but also illustrates how supervised learning allows these models to adjust to different tasks using observational learning efficiently.
Due to the enormous size of the most sophisticated LLMs, which have more than 100 billion parameters, many companies and individuals cannot afford the computational expense of SFTs. Studies have shown that models with fewer parameters can perform well in some cases, even outperforming larger models. Traditionally, datasets containing a large number of human-created examples are used for fine-tuning, increasing the adaptability of the final models. However, creating these databases is expensive and time-consuming. Additionally, commercial use of models trained on these data sets is often restricted by licensing requirements.
In a recent study, a team of Surge Global researchers created instruction-response pairs using open source instruction models licensed for commercial use to overcome these limitations. Three methods have been developed for generating data sets, producing instruction data sets that can be used for profit.
A human proxy model has been used to further improve these data sets in terms of quality and diversity. Using QLoRA, SFT was applied to the selected base model, generating three adapter models. The OpenBezoar model family consists of these models and a specific alignment model.
The objective of this work is to develop the OpenBezoar family of models by optimizing the OpenLLaMA 3Bv2 base model. There are several steps in the process, which are as follows,
- Data Generation: An open, commercially available, instruction-tuned version of the Falcon-40B model has been used to generate synthetic instruction tuning data. LaMini-LM, WizardLM/Evol-Instruct (using data bricks-dolly-15k as seed data set) and Orca (using Flan Collection as seed data set) are the three different techniques that have been used for data production.
- Data Filtering: To ensure quality and relevance, the data generated is filtered using GPT-4, a human proxy.
- Supervised fine-tuning: Each scheme undergoes a sequential cost-effective QLoRA-based supervised fine-tuning process in which model parameters are changed to improve performance on particular tasks.
- Distribution shift minimization: To ensure that the model performs well on a variety of data sets, the supervised and fitted checkpoint is further refined using a subset of the HH-RLHF data set.
- Direct Preference Optimization (DPO): Applying the DPO loss function produces the last checkpoint, “OpenBezoar-HH-RLHF-DPO”. In this step, the model is directly aligned with human preferences, eliminating the need for an additional reward model.
The team has shared that the 'LLM-as-a-judge' framework with Claude 2.1 and LM Eval Harness tasks have been used to evaluate the final checkpoint in MT-Bench. The results have shown that the 'OpenBezoar-HH-RLHF-DPO' checkpoint performs better than many models at the 3B parameter scale. It even beats the top model on the Huggingface Open LLM leaderboard in one of the categories.
The OpenBezoar-SFT, OpenBezoar-HH-RLHF-SFT, and OpenBezoarHH-RLHF-DPO checkpoints have been published and can be accessed on HuggingFace.
Review the Paper, HF Data Setsand Code base. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>