He QWEN 2.5 Multilingual Large Language Models (LLMS) They are a collection of pre-registered and adjusted generative models of instruction in 0.5b, 1.5b, 3b, 7b, 14b, 32b and 72b (text in/text or coding). QWEN 2.5 fine text models are optimized for cases for the use of multilingual dialogue and overcome previous generations of QWEN models, and many of the chat models publicly available based on common reference points in the industry.
In its nucleus, Qwen 2.5 is a self -giving language model that uses an optimized transformative architecture. The QWEN2.5 collection can admit more than 29 languages and has improved roles game skills and condition establishment for chatbots.
In this publication, we describe how to start with the implementation of the QWEN 2.5 family of models in an instance of Inferentia using amazon Elastic Compute Cloud (amazon EC2) and amazon Sagemaker using the hug text generation inference contained (TGI) and the neuron optimum neuron library. Coder and mathematics variants QWEN2.5 are also compatible.
Preparation
Hugging the face provides two tools that are frequently used when using AWS Inferentia and Aws Training: Text generation inference (TGI) containers, which provide support to implement and serve LLMS, and the Optimal neurons librarywhich serves as an interface between the Transformers library and inference and training accelerators.
The first time a model is executed in Inferentia or Training, compiles the model to ensure that it has a version that works optimally in Inferentia and training chips. The optimal neurons library hugged along with the optimal neurons cache will transparently provide a compiled model when available. If you are using a different model with the QWEN2.5 architecture, you may have to compile the model before implementing. For more information, see Compile a model for inferentia or training.
You can implement TGI as a Looping container in an instance of Inferentia or Training EC2 or at amazon Sagemaker.
Option 1: Implement TGI on amazon EC2 Inf2
In this example, it will implement the instructions QWEN2.5-7B in an INF2.XLARGE instance. (See This article To obtain detailed instructions on how to implement an instance using the hug of Domi).
For this option, SSH in the instance and creates an .env file (where you will define your constants and specify where your model is in cache) and a file called Docker-Compos.yaml (where you will define all the parameters of the environment that you will need to implement your inference model). You can copy the following files for this case of use.
- Create a .Env file with the following content:
- Create a file called Docker-Compose.yaml with the following content:
- Use Docker Compose to implement the model:
docker compose -f docker-compose.yaml --env-file .env up
- To confirm that the model was correctly implemented, send a trial message to the model:
- To confirm that the model can answer in several languages, try to send a notice in Chino:
Option 2: Implement TGI in SageMaker
You can also use the Hugging FACE optimal neurons library to quickly implement models directly from SageMaker using instructions in the hub model of clamps.
- From the QWEN 2.5 model cards center, choose Deployso SageMakerand finally AWS Inferentia and Training.
<img class="alignnone wp-image-100290 size-full" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://technicalterrence.com/wp-content/uploads/2025/03/How-to-execute-Qwen-25-on-AWS-AI-chips-using.jpg" alt="How to implement the model at amazon Sagemaker” width=”1039″ height=”485″/>
- Copy the example code into a SageMaker notebook, then choose Run.
- The notebook that copied will be seen as the following:
Clean
Be sure to finish your instances of EC2 and eliminate your final points of Sagemaker to avoid continuous costs.
Finish EC2 instances through the AWS management console.
Finish an end point of Sagemaker through the console or with the following commands:
Conclusion
AWS Trainium and AWS Inferentia offer high performance and low cost to implement QWEN2.5 models. We are excited to see how they will use these powerful models and our specialized infrastructure to create differentiated applications. For more information on how to start with AWS ai chips, see the AWS neuron documentation.
About the authors
Jim Burtoft He is a senior architect of startup solutions in AWS and works directly with new companies, as well as the Hugging Face team. Jim is a CISSP, part of the AWS ai/ML technical field community, part of the Neuron data science community, and works with the open source community to allow the use of inferenties and training. Jim has a degree in Mathematics from Carnegie Mellon University and a Master's Degree in Economics at the University of Virginia.
Miriam Lebowitz He is an architect of solutions focused on empowering new companies in the initial stage in AWS. Take advantage of your experience with AIML to guide companies to select and implement appropriate technologies for their commercial objectives, establishing them for scalable growth and innovation in the world of competitive start.
Rhia Soni He is an architect of starting solutions in AWS. Rhia specializes in working with new stages companies and helps customers adopt inferentia and training. Rhia is also part of the AWS Analytics technical field community and is an expert in generative bi. Rhia has a degree in Information Sciences at Maryland University.
Paul's help He is a senior manager of architect of solutions focused on new companies in AWS. Paul created a team of AWS beginning solutions architects that focus on the adoption of Inferentia and Training. Paul has a Siena College degree in computer science and has multiple cyber security certifications.