AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Today, we are excited to announce support for AWS Trainium and AWS Inferentia for fine-tuning and inference of Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of instruction-tuned, pre-trained generative models in 8B, 70B, and 405B sizes. In a previous post, we explained how to deploy Llama 3 models on AWS Trainium and Inferentia-based instances in amazon SageMaker JumpStart. In this post, we describe how to get started tuning and deploying the Llama 3.1 model family on AWS ai chips to realize price and performance benefits.

Overview of the Llama 3.1 models

The Llama 3.1 family of multilingual generative models is a collection of instruction-tuned, pre-trained generative models in 8B, 70B, and 405B (input text/output text and output code) sizes. All models support large contexts (128k) and are optimized for inference with support for grouped query attention (GQA).

Llama 3.1 optimized instruction models (8B, 70B, 405B) are optimized for multilingual dialog use cases and outperform many of the publicly available chat models on common industry benchmarks. They have been trained to generate tool calls for a few specific tools for capabilities such as search, image generation, code execution, and mathematical reasoning. Additionally, they support the use of zero-shot tools.

Llama 3.1 405B is the world’s largest publicly available long-term learning model according to Meta. The model sets a new standard for artificial intelligence (ai) and is ideal for enterprise-grade applications and research and development. It is ideal for tasks such as synthetic data generation, where model outputs can be used to improve smaller Llama models after fine-tuning, and model distillations to transfer knowledge to smaller models from the 405B model. This model excels in general knowledge, long-form text generation, multilingual translation, machine translation, coding, mathematics, tool usage, enhanced contextual understanding, and advanced reasoning and decision making.

Architecturally, the LLM core for Llama 3 and Llama 3.1 have the same dense architecture. They are autoregressive language models that use an optimized transformer architecture. The optimized versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helping and safety.

He ai.meta.com/llama/responsible-use-guide/” target=”_blank” rel=”noopener”>Meta Responsible Use Guide It can help you implement additional adjustments that may be necessary to customize and optimize models with appropriate security mitigations.

Trainium powers Llama 3.1 on amazon Bedrock and amazon SageMaker

The fastest way to get started with Llama 3.1 on AWS is through amazon Bedrock, which is powered by our purpose-built ai infrastructure, including AWS Trainium. Through its fully managed API, amazon Bedrock delivers the benefits of our purpose-built ai infrastructure and simplifies access to these powerful models so you can focus on building differentiated ai applications.

If you need more control over the underlying resources, you can tune and deploy Llama 3.1 models using SageMaker. Trainium support for Llama 3.1 in SageMaker JumpStart is coming soon.

AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models

If you want to build your own ML pipelines for training and inference for greater flexibility and control, you can get started with Llama 3.1 on AWS ai Chips using amazon Elastic Compute Cloud (amazon EC2) Trn1 and Inf2 instances. Let's see how you can get started with the new Llama 3.1 8/70B models on Trainium using AWS Neuron SDK.

Flame Adjustment 3.1 in Trainium

To start adjusting Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX distributed NeuronX Distributed Library offers implementations of some of the most popular distributed training and inference techniques. To start fine-tuning, you can use the following examples:

Both examples are built on top of AWS ParallelCluster to manage the Trainium cluster infrastructure and Slurm for workload management. The following is the example Slurm command to start training for Llama3.1 70B:

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

Inside the Slurm script, we launch a distributed training process on our cluster. In the execution scripts, we load the pre-trained weights and configuration provided by Meta, and launch the training process:

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \
    --train_batch_size $BS \
    --use_meta_device_init 1 \
    --training_dir $DATA_PATH \
    --training_config $SCRIPT_DIR/${MODEL_SIZE}_config_llama${LLAMA_VERSION} \
    --max_steps $max_steps \
    --seq_len $SEQ_LEN \
    --pipeline_parallel_size $PP_DEGREE \
    --tensor_parallel_size $TP_DEGREE \
    --num_microbatches $NUM_MICROBATCHES \
    --lr 0.000015 \
    --min_lr 1e-06 \
    --beta1 0.9 \
    --beta2 0.95 \
    --weight_decay 0.1 \
    --warmup_steps 2000 \
    --constant_steps 0 \
    --use_zero1_optimizer 1 \
    --use_selective_checkpoint 1 \
    --use_flash_attention 1 \
    --qkv_linear 1 \
    --kv_replicator 4 \
    --pretrained_weight 1 \
    --save_load_xser 1 \
    --checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" \
    --checkpoint_freq $checkpoint_freq \
    --num_kept_checkpoint -1 \
    --loading_step -1 \
    --tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS(0)}

Implementing Llama 3.1 in Trainium or Inferentia

When your model is ready to be deployed, you can do so by updating the model ID in the Llama 3 8B Neuron sample code above. For example, the code below deploys the model to a inf2.48xlarge instance.

model_id = "meta-llama/Meta-Llama-3.1-8B"
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)
neuron_model.to_neuron()

You can use the same sample inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Hello, I'm a language model and I like to"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = (tokenizer.decode(seq) for seq in generated_sequences)
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

For step-by-step details, see the new Llama 3.1 examples:

You can also use the Hugging Face Optimum Neuron library to quickly deploy models directly from SageMaker via the Hugging Face Model Center. From the Llama 3.1 Model Card Center, choose Deployso Creator of Sageand finally AWS Inferentia and TrainiumCopy the example code into a SageMaker notebook and then select Run.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")("Role")("Arn")

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "",
}

assert hub("HF_TOKEN") != "", "Please replace '' with your Hugging Face Hub API token"


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Also, if you want to use vLLM to deploy the models, you can refer to the Continuous dosing guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models to AWS Trainium or Inferentia. Here is an example for deploying Llama 3.1 8B:

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = (
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of ai is",
)
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device="neuron",
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs(0).text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Conclusion

AWS Trainium and Inferentia offer high performance and low cost for tuning and deploying Llama 3.1 models. We can’t wait to see how you’ll use these powerful models and our purpose-built ai infrastructure to build differentiated ai applications. To learn more about how to get started with AWS ai chips, see Sample models and tutorials in the AWS Neuron documentation.

About the authors

John Gray John is a Senior Solutions Architect at Annapurna Labs, AWS, based in Seattle. In this role, John works with customers on their ai and machine learning use cases, designs solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS ai chips.

Pink Panigrahi He works with customers to build ML-powered solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative ai models on AWS ai chips.

Kamran KhanDirector of Business Development for AWS Inferentia/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.

Sruti Koparkar is a Senior Product Marketing Manager at AWS. He helps customers explore, evaluate, and adopt amazon EC2 accelerated computing infrastructure for their machine learning needs.

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Technical Terrence Team

Royal Caribbean shares plans for smaller cruise ships

Leave a Reply Cancel reply

Recommended.

NuggetRush, Aptos and Fantom make notable progress

Sony and Astar Network Launch Web3 Incubation Program for Projects Focused on NFTs and DAOs Bitcoin News

Stop guessing and measure your RAG system to drive real improvements | by Abhinav Kimothi | October 2024

Will BTC recover or fall to $ 76,000?

Strategic Linear Contextual Bandits – Apple Machine Learning Research

Categories

Important Links

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Overview of the Llama 3.1 models

Trainium powers Llama 3.1 on amazon Bedrock and amazon SageMaker

AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models

Flame Adjustment 3.1 in Trainium

Implementing Llama 3.1 in Trainium or Inferentia

Conclusion

About the authors

Related

Technical Terrence Team

Royal Caribbean shares plans for smaller cruise ships

Leave a Reply Cancel reply

Recommended.

NuggetRush, Aptos and Fantom make notable progress

Sony and Astar Network Launch Web3 Incubation Program for Projects Focused on NFTs and DAOs Bitcoin News

Stop guessing and measure your RAG system to drive real improvements | by Abhinav Kimothi | October 2024

Will BTC recover or fall to $ 76,000?

Strategic Linear Contextual Bandits – Apple Machine Learning Research

Categories

Important Links

Get daily news updates to your inbox!