Large Language Models (LLM) have captured the imagination and attention of developers, scientists, technologists, entrepreneurs, and executives across various industries. These models can be used to answer questions, summarize, translate, and more in applications such as conversational agents for customer service, content creation for marketing, and coding assistants.
Recently, Meta launched Call 2 for both researchers and commercial entities, joining the list of other LLMs, including MosaicML MPT and Hawk. In this post, we explain how to tune Llama 2 on AWS Trainium, an accelerator designed specifically for LLM training, to reduce training times and costs. We reviewed the tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the performance results we saw.
About the Llama 2 model
Similar to previous ai.meta.com/blog/large-language-model-llama-meta-ai/” target=”_blank” rel=”noopener”>Call 1 Model and other models like GPT, Llama 2 uses the decoder-only architecture of the Transformer. It comes in three sizes: 7 billion, 13 billion and 70 billion parameters. Compared to Call 1, Call 2 doubles the context length from 2000 to 4000 and uses pooled query attention (only for 70B). Llama 2’s pre-trained models are trained with 2 billion tokens and its fine-tuned models have been trained with over 1 million human annotations.
Llama 2 Distributed Training
To accommodate Llama 2 with a sequence length of 2000 and 4000, we implement the script using NeMo Megatron for Trainium that supports data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). To be specific, with the new implementation of some features like untie word embedding, rotating embedding, RMSNorm and Swiglu activation, we use the generic script from GPT neuron Megatron-LM to support the Llama 2 training script.
Our high-level training procedure is as follows: For our training environment, we use a multi-instance cluster managed by the SLURM system for distributed training and programming under the NeMo framework.
First, download the Llama 2 model and training data sets and process them using the Llama 2 tokenizer. For example, to use the RedPajama data set, use the following command:
For detailed guidance on downloading models and the preprocessing script argument, see Download the LlamaV2 dataset and tokenizer.
Next, compile the model:
After building the model, start the training job with the following script that is already optimized with the best settings and hyperparameters for Llama 2 (included in the example code):
Lastly, we monitor TensorBoard to track training progress:
For the full example code and scripts we mentioned, see Llama 7B. tutorial and nemo code in Neuron SDK for more detailed steps.
Fitting experiments
We fit the 7B model on the OSCAR (Open Super-large Crawled ALMAnaCH corpus) and QNLI (Question Answering NLI) datasets in a Neuron 2.12 environment (PyTorch). For each sequence length of 2000 and 4000, we optimize some settings, such as batchsize
and gradient_accumulation
, for training efficiency. As a tuning strategy, we adopt full tuning of all parameters (about 500 steps), which can be extended to pre-training with longer steps and larger data sets (e.g., 1T RedPajama). Sequence parallelism can also be enabled to allow NeMo Megatron to successfully fine-tune models with a sequence length greater than 4000. The following table shows the configuration and performance results of the Llama 7B fine-tuning experiment. Performance increases almost linearly as the number of instances increases up to 4.
Distributed library | Data sets | Sequence length | Number of instances | parallel tensor | Parallel data | Parallel pipe | Global Lot Size | Throughput (seq/s) |
Neuron NeMo Megatron | Oscar | 4096 | 1 | 8 | 4 | 1 | 256 | 3.7 |
. | . | 4096 | 2 | 8 | 4 | 1 | 256 | 7.4 |
. | . | 4096 | 4 | 8 | 4 | 1 | 256 | 14.6 |
. | QNLI | 4096 | 4 | 8 | 4 | 1 | 256 | 14.1 |
The last step is to verify the accuracy with the base model. We implemented a reference script for GPU experiments and confirmed that the training curves for GPU and Trainium matched as shown in the figure below. The figure illustrates the loss curves over the number of training steps on the QNLI dataset. Adopted mixed precision for GPU (blue) and bf16 with default stochastic rounding for Trainium (orange).
Conclusion
In this post, we show that Trainium offers high performance and cost-effective fine-tuning of Llama 2. For more resources on using Trainium for distributed pre-training and tuning your generative ai models using NeMo Megatron, see AWS Neuron Reference for NeMo Megatron.
About the authors
Hao Zhou is a Research Scientist at Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization and generative artificial intelligence techniques to various real-world problems. He has a PhD in Electrical Engineering from Northwestern University.
Karthick Gopalswamy is an applied scientist at AWS. Prior to AWS, he worked as a scientist at Uber and Walmart Labs, primarily focusing on mixed integer optimization. At Uber, he focused on optimizing the public transportation network with ride-sharing and on-demand SaaS products. At Walmart Labs, he worked on pricing and packaging optimizations. Karthick holds a PhD in Industrial and Systems Engineering with a concentration in Operations Research from North Carolina State University. His research focuses on models and methodologies that combine operations research and machine learning.
Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker Integrated Algorithms. He focuses on developing scalable machine learning algorithms. His research interests lie in the area of natural language processing, explainable deep learning on tabular data, and robust nonparametric spatiotemporal clustering analysis. He has published numerous papers at ACL, ICDM, KDD and Royal Statistical Society: Series A conferences.
Youngsuk Park is a Senior Applied Scientist at AWS Annapurna Labs working on developing and training foundational models in ai accelerators. Prior to that, Dr. Park worked in R&D for Amazon Forecast at AWS ai Labs as a Principal Scientist. His research lies in the interplay between machine learning, fundamental models, optimization and reinforcement learning. He has published more than 20 peer-reviewed articles in leading venues including ICLR, ICML, AISTATS and KDD, with the service of organizing workshops and presenting tutorials in the area of time series and LLM training. Before joining AWS, he earned a PhD in Electrical Engineering from Stanford University.
Yida Wang is a principal scientist on Amazon’s AWS ai team. His research interest is in systems, high performance computing and big data analytics. He currently works on deep learning systems, focusing on building and optimizing deep learning models for efficient training and inference, especially large-scale basic models. The mission is to unite high-level models from various low-level hardware frameworks and platforms, including CPUs, GPUs, and ai accelerators, so that different models can run with high performance on different devices.
Jun (Lucas) Huan is a Principal Scientist at AWS ai Labs. Dr. Huan works in ai and data science. He has published more than 160 peer-reviewed articles in leading conferences and journals and has graduated 11 PhD students. He received the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and director of the Baidu Big Data Laboratory. He founded StylingAI Inc., an artificial intelligence startup, and served as CEO and Chief Scientist in 2019-2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015 to 2018, he worked as a program manager at the US NSF in charge of its big data program.
Sruti Koparkar is a senior product marketing manager at AWS. Helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.