How to execute Qwen 2.5 on AWS AI chips using Face Libraries

He QWEN 2.5 Multilingual Large Language Models (LLMS) They are a collection of pre-registered and adjusted generative models of instruction in 0.5b, 1.5b, 3b, 7b, 14b, 32b and 72b (text in/text or coding). QWEN 2.5 fine text models are optimized for cases for the use of multilingual dialogue and overcome previous generations of QWEN models, and many of the chat models publicly available based on common reference points in the industry.

In its nucleus, Qwen 2.5 is a self -giving language model that uses an optimized transformative architecture. The QWEN2.5 collection can admit more than 29 languages and has improved roles game skills and condition establishment for chatbots.

In this publication, we describe how to start with the implementation of the QWEN 2.5 family of models in an instance of Inferentia using amazon Elastic Compute Cloud (amazon EC2) and amazon Sagemaker using the hug text generation inference contained (TGI) and the neuron optimum neuron library. Coder and mathematics variants QWEN2.5 are also compatible.

Preparation

Hugging the face provides two tools that are frequently used when using AWS Inferentia and Aws Training: Text generation inference (TGI) containers, which provide support to implement and serve LLMS, and the Optimal neurons librarywhich serves as an interface between the Transformers library and inference and training accelerators.

The first time a model is executed in Inferentia or Training, compiles the model to ensure that it has a version that works optimally in Inferentia and training chips. The optimal neurons library hugged along with the optimal neurons cache will transparently provide a compiled model when available. If you are using a different model with the QWEN2.5 architecture, you may have to compile the model before implementing. For more information, see Compile a model for inferentia or training.

You can implement TGI as a Looping container in an instance of Inferentia or Training EC2 or at amazon Sagemaker.

Option 1: Implement TGI on amazon EC2 Inf2

In this example, it will implement the instructions QWEN2.5-7B in an INF2.XLARGE instance. (See This article To obtain detailed instructions on how to implement an instance using the hug of Domi).

For this option, SSH in the instance and creates an .env file (where you will define your constants and specify where your model is in cache) and a file called Docker-Compos.yaml (where you will define all the parameters of the environment that you will need to implement your inference model). You can copy the following files for this case of use.

Create a .Env file with the following content:

MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file called Docker-Compose.yaml with the following content:

version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"

Use Docker Compose to implement the model:

docker compose -f docker-compose.yaml --env-file .env up

To confirm that the model was correctly implemented, send a trial message to the model:

curl 127.0.0.1:8081/generate \
    -x POST \
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' \
    -H 'Content-Type: application/json'

To confirm that the model can answer in several languages, try to send a notice in Chino:

#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate \
    -x POST \
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' \
    -H 'Content-Type: application/json'

Option 2: Implement TGI in SageMaker

You can also use the Hugging FACE optimal neurons library to quickly implement models directly from SageMaker using instructions in the hub model of clamps.

From the QWEN 2.5 model cards center, choose Deployso SageMakerand finally AWS Inferentia and Training.

Copy the example code into a SageMaker notebook, then choose Run.
The notebook that copied will be seen as the following:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")("Role")("Arn")

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clean

Be sure to finish your instances of EC2 and eliminate your final points of Sagemaker to avoid continuous costs.

Finish EC2 instances through the AWS management console.

Finish an end point of Sagemaker through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia offer high performance and low cost to implement QWEN2.5 models. We are excited to see how they will use these powerful models and our specialized infrastructure to create differentiated applications. For more information on how to start with AWS ai chips, see the AWS neuron documentation.

About the authors

Jim Burtoft He is a senior architect of startup solutions in AWS and works directly with new companies, as well as the Hugging Face team. Jim is a CISSP, part of the AWS ai/ML technical field community, part of the Neuron data science community, and works with the open source community to allow the use of inferenties and training. Jim has a degree in Mathematics from Carnegie Mellon University and a Master's Degree in Economics at the University of Virginia.

Miriam Lebowitz He is an architect of solutions focused on empowering new companies in the initial stage in AWS. Take advantage of your experience with AIML to guide companies to select and implement appropriate technologies for their commercial objectives, establishing them for scalable growth and innovation in the world of competitive start.

Rhia Soni He is an architect of starting solutions in AWS. Rhia specializes in working with new stages companies and helps customers adopt inferentia and training. Rhia is also part of the AWS Analytics technical field community and is an expert in generative bi. Rhia has a degree in Information Sciences at Maryland University.

Paul's help He is a senior manager of architect of solutions focused on new companies in AWS. Paul created a team of AWS beginning solutions architects that focus on the adoption of Inferentia and Training. Paul has a Siena College degree in computer science and has multiple cyber security certifications.

How to execute Qwen 2.5 on AWS AI chips using Face Libraries

Technical Terrence Team

Amazon is selling an expandable hand luggage of $ 140 'durable' for only $ 54 that a buyer has used for a decade

Leave a Reply Cancel reply

Recommended.

NVIDIA GTC 2024 Keynote (Highlights and Important Announcements)

The vision from Utah: leadership, innovation and learning

Bitcoin price is approaching $100,000, but this PolitiFi coin could multiply by 100

The Google Pixel 8a is on sale for just $449

New York Fed's John Williams says 2% inflation target 'critical'

Categories

Important Links

How to execute Qwen 2.5 on AWS AI chips using Face Libraries

Preparation

Option 1: Implement TGI on amazon EC2 Inf2

Option 2: Implement TGI in SageMaker

Clean

Conclusion

About the authors

Related

Technical Terrence Team

Amazon is selling an expandable hand luggage of $ 140 'durable' for only $ 54 that a buyer has used for a decade

Leave a Reply Cancel reply

Recommended.

NVIDIA GTC 2024 Keynote (Highlights and Important Announcements)

The vision from Utah: leadership, innovation and learning

Bitcoin price is approaching $100,000, but this PolitiFi coin could multiply by 100

The Google Pixel 8a is on sale for just $449

New York Fed's John Williams says 2% inflation target 'critical'

Categories

Important Links

Get daily news updates to your inbox!