Today, we are excited to announce support for AWS Trainium and AWS Inferentia for fine-tuning and inference of Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of instruction-tuned, pre-trained generative models in 8B, 70B, and 405B sizes. In a previous post, we explained how to deploy Llama 3 models on AWS Trainium and Inferentia-based instances in amazon SageMaker JumpStart. In this post, we describe how to get started tuning and deploying the Llama 3.1 model family on AWS ai chips to realize price and performance benefits.
Overview of the Llama 3.1 models
The Llama 3.1 family of multilingual generative models is a collection of instruction-tuned, pre-trained generative models in 8B, 70B, and 405B (input text/output text and output code) sizes. All models support large contexts (128k) and are optimized for inference with support for grouped query attention (GQA).
Llama 3.1 optimized instruction models (8B, 70B, 405B) are optimized for multilingual dialog use cases and outperform many of the publicly available chat models on common industry benchmarks. They have been trained to generate tool calls for a few specific tools for capabilities such as search, image generation, code execution, and mathematical reasoning. Additionally, they support the use of zero-shot tools.
Llama 3.1 405B is the world’s largest publicly available long-term learning model according to Meta. The model sets a new standard for artificial intelligence (ai) and is ideal for enterprise-grade applications and research and development. It is ideal for tasks such as synthetic data generation, where model outputs can be used to improve smaller Llama models after fine-tuning, and model distillations to transfer knowledge to smaller models from the 405B model. This model excels in general knowledge, long-form text generation, multilingual translation, machine translation, coding, mathematics, tool usage, enhanced contextual understanding, and advanced reasoning and decision making.
Architecturally, the LLM core for Llama 3 and Llama 3.1 have the same dense architecture. They are autoregressive language models that use an optimized transformer architecture. The optimized versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helping and safety.
He ai.meta.com/llama/responsible-use-guide/” target=”_blank” rel=”noopener”>Meta Responsible Use Guide It can help you implement additional adjustments that may be necessary to customize and optimize models with appropriate security mitigations.
Trainium powers Llama 3.1 on amazon Bedrock and amazon SageMaker
The fastest way to get started with Llama 3.1 on AWS is through amazon Bedrock, which is powered by our purpose-built ai infrastructure, including AWS Trainium. Through its fully managed API, amazon Bedrock delivers the benefits of our purpose-built ai infrastructure and simplifies access to these powerful models so you can focus on building differentiated ai applications.
If you need more control over the underlying resources, you can tune and deploy Llama 3.1 models using SageMaker. Trainium support for Llama 3.1 in SageMaker JumpStart is coming soon.
AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models
If you want to build your own ML pipelines for training and inference for greater flexibility and control, you can get started with Llama 3.1 on AWS ai Chips using amazon Elastic Compute Cloud (amazon EC2) Trn1 and Inf2 instances. Let's see how you can get started with the new Llama 3.1 8/70B models on Trainium using AWS Neuron SDK.
Flame Adjustment 3.1 in Trainium
To start adjusting Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX distributed NeuronX Distributed Library offers implementations of some of the most popular distributed training and inference techniques. To start fine-tuning, you can use the following examples:
Both examples are built on top of AWS ParallelCluster to manage the Trainium cluster infrastructure and Slurm for workload management. The following is the example Slurm command to start training for Llama3.1 70B:
Inside the Slurm script, we launch a distributed training process on our cluster. In the execution scripts, we load the pre-trained weights and configuration provided by Meta, and launch the training process:
Implementing Llama 3.1 in Trainium or Inferentia
When your model is ready to be deployed, you can do so by updating the model ID in the Llama 3 8B Neuron sample code above. For example, the code below deploys the model to a inf2.48xlarge
instance.
You can use the same sample inference code:
For step-by-step details, see the new Llama 3.1 examples:
You can also use the Hugging Face Optimum Neuron library to quickly deploy models directly from SageMaker via the Hugging Face Model Center. From the Llama 3.1 Model Card Center, choose Deployso Creator of Sageand finally AWS Inferentia and TrainiumCopy the example code into a SageMaker notebook and then select Run.
Also, if you want to use vLLM to deploy the models, you can refer to the Continuous dosing guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models to AWS Trainium or Inferentia. Here is an example for deploying Llama 3.1 8B:
Conclusion
AWS Trainium and Inferentia offer high performance and low cost for tuning and deploying Llama 3.1 models. We can’t wait to see how you’ll use these powerful models and our purpose-built ai infrastructure to build differentiated ai applications. To learn more about how to get started with AWS ai chips, see Sample models and tutorials in the AWS Neuron documentation.
About the authors
John Gray John is a Senior Solutions Architect at Annapurna Labs, AWS, based in Seattle. In this role, John works with customers on their ai and machine learning use cases, designs solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS ai chips.
Pink Panigrahi He works with customers to build ML-powered solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative ai models on AWS ai chips.
Kamran KhanDirector of Business Development for AWS Inferentia/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.
Sruti Koparkar is a Senior Product Marketing Manager at AWS. He helps customers explore, evaluate, and adopt amazon EC2 accelerated computing infrastructure for their machine learning needs.