Serving LLM using vLLM and Amazon EC2 instances with AWS AI chips

The use of large language models (LLM) and generative ai has exploded over the past year. With the release of powerful publicly available core models, the tools to train, tune, and host your own LLM have also become democratized. Wearing <a target="_blank" href="https://docs.vllm.ai/en/stable/index.html” target=”_blank” rel=”noopener”>vllm on AWS Trainium and Inferentia makes it possible to host LLM for high-performance inference and scalability.

In this post, we will explain how you can quickly implement The latest models of Llama de Metausing vLLM on an amazon Elastic Compute Cloud (amazon EC2) Inf2 instance. For this example, we will use version 1B, but other sizes can be implemented by following these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta's newer Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model, and how to run online and offline inference on the model. We will also talk about tuning the performance of the inference graph.

Prerequisite: Hugging Face account and model access

To use the meta-llama/Llama-3.2-1B model, you will need a Hugging Face account and access to the model. Please go to model cardRegister and accept the model license. You will then need a Hugging Face token, which you can obtain by following these steps. When you arrive at Save your access token screen, as shown in the following figure, be sure to copy the token because it will not be displayed again.

Create an EC2 instance

You can create an EC2 instance by following the guide. Some things to keep in mind:

If this is your first time using inf/trn instances, you will need to request a fee increase.
you will use inf2.xlarge as its instance type. inf2.xlarge Instances are only available in these AWS regions.
Increase the gp3 volume to 100 G.
you will use Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

Once the instance is started, you can connect to it to access the command line. In the next step, you will use Docker (pre-installed on this AMI) to run a vLLM container image for neuron.

Start the vLLM server

You will use Docker to create a container with all the tools necessary to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<\EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
    apt-get install -y \
        git \
        python3 \
        python3-pip \
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME ( ${APP_MOUNT} )
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && \
    cd vllm && \
    git checkout v0.6.2 && \
    python3 -m pip install -U \
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
        -r requirements-neuron.txt && \
    pip install --no-build-isolation -v -e . && \
    pip install --upgrade triton==3.0.0
CMD ("/bin/bash")
EOF

Then run:

docker build . -t vllm-neuron

Building the image will take about 10 minutes. Once done, use the new Docker image (replace YOUR_TOKEN_HERE with the Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        vllm-neuron

You can now start the vLLM server with the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: The face that embraces modelID of the model being implemented for inference.
--device neuron: Configures vLLM to run on the neural device.
--tensor-parallel-size 2– Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neural device and each neural device has 2 neural cores.
--max-model-len 4096: Set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
--block-size 8: For neural devices, this is set internally to max-model-len.
--max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

The first time you load a model, if there is no previously built model, it will need to be compiled. This compiled model can optionally be saved, so the build step is not necessary if the container is recreated. Once everything is done and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but is not yet processing requests because none have been received. It can now be detached from the container by pressing ctrl + p and ctrl + q.

Inference

When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B is running.

curl localhost:8000/v1/models

This should return something like:

{"object":"list","data":({"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":({"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false})})}ubuntu@ip-172-31-12-216:~$

Now, send him a message:

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen ai?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'

You should receive a response similar to the following from vLLM:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen ai?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?\nGen ai is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen ai is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.\nGen ai is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference with vLLM

Another way to use vLLM in Inferentia is to send a few requests at the same time in a script. This is useful for automation or when you have a batch of messages that you want to send all at the same time.

You can reconnect to your Docker container and stop the inference server online with the following:

docker attach $(docker ps --format "{{.ID}}")

At this point you should see a blank cursor, press ctrl + c to stop the server and you should be returned to the bash prompt in the container. Create a file to use the offline inference engine:

cat > offline_inference.py <<eof from="" vllm.entrypoints.llm="" import="" llm="" vllm.sampling_params="" samplingparams="" sample="" prompts.="" prompts="(" my="" name="" is="" president="" of="" the="" united="" states="" capital="" france="" future="" ai="" create="" a="" sampling="" params="" object.="" sampling_params="SamplingParams(temperature=0.8," top_p="0.95)" an="" llm.="" max_num_seqs="32," max_model_len="4096," block_size="8," device="neuron" tensor_parallel_size="2)" generate="" texts="" output="" list="" requestoutput="" objects="" that="" contain="" prompt="" generated="" text="" and="" other="" information.="" outputs="llm.generate(prompts," print="" outputs.="" for="" in="" outputs:="" generated_text="output.outputs(0).text" text:="" eof=""/>

Now, run the script. python offline_inference.py and you should receive answers to all four prompts. This may take a minute as the model needs to be restarted.

Processed prompts: 100%|
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 (00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s)
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of ai is', Generated text: ' now\nThe 10 most influential ai professionals to watch in 2019\n'

Now you can write exit and press return and then press ctrl + c to close the Docker container and return to your inf2 instance.

Clean

Now that you have finished testing Llama 3.2 1B LLM, you should cancel your EC2 instance to avoid additional charges.

Performance tuning for variable sequence lengths

You will probably need to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a calculation graph that works with the shape and size of the buckets. To tune performance based on the length of input and output tokens in inference requests, you can configure two types of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of numbers integers:

NEURON_CONTEXT_LENGTH_BUCKETS It corresponds to the context coding phase. Set this to the estimated duration of the prompts during inference.
NEURON_TOKEN_GEN_BUCKETS It corresponds to the token generation phase. Set this to a range of powers of two within the duration of your spawn.

You can use the Docker run command to configure the environment variables when starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" \
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" \
        vllm-neuron

You can then start the server using the same command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

Since the model graph has changed, the model will need to be compiled again. If the container was canceled, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen ai?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'

For more information on how to set up buckets, see the Developer's guide to bundling. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

Conclusion

You just saw how to implement meta-llama/Llama-3.2-1B using vLLM on an amazon EC2 Inf2 instance. If you are interested in implementing other popular Hugging Face LLMs, you can replace the modelID in the vLLM serve domain. More details about the integration between Neuron SDK and vLLM can be found at Neuron User Guide for Continuous Batch Processing and the <a target="_blank" href="https://docs.vllm.ai/en/latest/getting_started/neuron-installation.html” target=”_blank” rel=”noopener”>vLLM Guide for Neuron.

Once you've identified a model you want to use in production, you'll want to deploy it with auto-scaling, observability, and fault tolerance. You can also check this. <a target="_blank" href="http://amazon.com/blogs/machine-learning/deploy-meta-llama-3-1-8b-on-aws-inferentia-using-amazon-eks-and-vllm/” target=”_blank” rel=”noopener”>blog post to understand how to deploy vLLM on Inferentia through amazon Elastic Kubernetes Service (amazon EKS). In the next post in this series, we will discuss using amazon EKS with Ray Serve to deploy vLLM in production with auto-scaling and observability.

About the authors

Omri Shiva is an open source machine learning engineer focused on helping customers on their ai/ML journey. In his free time, he enjoys cooking, playing with open source and open hardware, and listening to and playing music.

Pink Panigrahi works with clients to create ML-based solutions to solve strategic business problems on AWS. In his current role, he works on optimizing the training and inference of generative ai models on AWS ai chips.

Serving LLM using vLLM and Amazon EC2 instances with AWS AI chips

Technical Terrence Team

Compass Group share price looks set to grow after positive 2024 results

Leave a Reply Cancel reply

Recommended.

MYCAL Collects a round of seeds of $ 4.3 million to capture CO2 using 'prehistoric chemistry'

Bitcoin Hash Rate Drops 10% After Hitting ATH, But Why?

Like BTC, XRP, ETH DIP, investors change the focus on a high growth high growth presale Altcoin

WUFFI increases by 192.17% in 24 hours, approaching an all-time high

UK announces £5.5m program to support fintech industry

Categories

Important Links

Serving LLM using vLLM and Amazon EC2 instances with AWS AI chips

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

Prerequisite: Hugging Face account and model access

Create an EC2 instance

Start the vLLM server

Inference

Offline inference with vLLM

Clean

Performance tuning for variable sequence lengths

Conclusion

About the authors

Related

Technical Terrence Team

Compass Group share price looks set to grow after positive 2024 results

Leave a Reply Cancel reply

Recommended.

MYCAL Collects a round of seeds of $ 4.3 million to capture CO2 using 'prehistoric chemistry'

Bitcoin Hash Rate Drops 10% After Hitting ATH, But Why?

Like BTC, XRP, ETH DIP, investors change the focus on a high growth high growth presale Altcoin

WUFFI increases by 192.17% in 24 hours, approaching an all-time high

UK announces £5.5m program to support fintech industry

Categories

Important Links

Get daily news updates to your inbox!