Microsoft AI Open Source DeepSpeed Chat - An end-to-end RLHF pipeline for training ChatGPT-like models

It is not an exaggeration to say that concepts similar to ChatGPT have had a revolutionary effect on the digital world. For this reason, the open source AI community is working on some projects (such as ChatLLaMa, Alpaca, etc.) that aim to make ChatGPT-style models more available. These models are extremely flexible and can perform tasks such as summarizing, coding, and translation at or above human experience levels.

Despite these impressive efforts, a publicly available end-to-end RLHF pipeline is still unable to train a robust ChatGPT-like model. Training efficiency is typically less than 5% of the capabilities of these machines, even when access to such computing resources is available. Despite access to multi-GPU clusters, existing systems cannot support simple, fast, and inexpensive training of next-generation ChatGPT models with billions of parameters.

These restrictions stem from the fact that the sophisticated RLHF training flow used by InstructGPT is not well supported by existing DL systems, which are optimized for more conventional pretraining and fine-tuning flows. To make ChatGPT-like models more widely available and RLHF training more accessible, the Microsoft team is releasing DeepSpeed-Chat, which offers an end-to-end RLHF pipeline for training ChatGPT-like models. It has the following characteristics:

Check out 100 AI tools at AI Tools Club

1. A convenient environment for training and inferring models similar to ChatGPT: InstructGPT training can be run on a pre-trained Huggingface model with a single script using the DeepSpeed-RLHF system. This allows the user to generate their model similar to ChatGPT. Once the model is trained, an inference API can be used to test the conversational interactions.

2. The DeepSpeed-RLHF pipeline: The DeepSpeed-RLHF pipeline largely replicates the training pipeline from the InstructGPT document. The team ensured a complete and exact correspondence between the three steps a) Supervised Fine Tuning (SFT), b) Reward Model Fine Tuning and c) Reinforced Learning with Human Feedback (RLHF). In addition, they also provide tools for data abstraction and combination that allow training using data from various sources.

3. The DeepSpeed-RLHF system: Hybrid Engine (DeepSpeed-HE) for RLHF is a powerful and sophisticated system that combines the training and inference capabilities of DeepSpeed. The hybrid engine can easily switch between RLHF training and inference modes, taking advantage of DeepSpeed-Inference optimizations such as tensor parallelism and high-performance transformer cores for generation, as well as the many memory optimization strategies of RLHF like ZeRO and LoRA. To further optimize memory management and data transfer at the various stages of RLHF, DeepSpeed-HE is also aware of the entire RLHF process. The DeepSpeed-RLHF system achieves unprecedented efficiencies at scale, allowing the AI community to quickly, cheaply, and conveniently access training on complex RLHF models.

4. Efficiency and Affordability: Because DeepSpeed-HE is more than 15 times faster than conventional systems, RLHF training can be completed quickly and cheaply.

5. excellent scalability: DeepSpeed-HE’s high scalability on multi-node, multi-GPU systems allows it to accommodate models with hundreds of billions of parameters.

6. Expanding access to RLHF education: DeepSpeed-HE allows data scientists without access to multi-GPU systems to build not only toy RLHF models, but also massive and powerful models that can be deployed in real-world environments, all with a single GPU for training.

The researchers embedded a complete end-to-end training pipeline in DeepSpeed-Chat and modeled it after InstructGPT to make the training process as efficient as possible.

The production process consists of three stages:

1. Pretrained language models are tuned using supervised fine tuning (SFT), in which human responses to various queries are carefully selected.

2. Next, the team performs “reward model fine tuning,” which involves training a different (often smaller than the SFT) model (RW) using a data set that includes human-provided ratings of numerous responses. to the same query.

3. Finally, in RLHF training, the Proximal Policy Optimization (PPO) algorithm is used to further fit the SFT model with the reward feedback from the RW model.

DeepSpeed-Chat is now accessible to the AI community thanks to its open source nature. On the DeepSpeed GitHub website, the researchers invite users to report issues, submit PRs, and participate in discussions.

review the Code. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Check out 100 AI tools at AI Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.

JOIN the fastest ML subreddit community