This is a guest post by AK Roy from Qualcomm ai.
Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm ai 100 Standard accelerators, can be used to cost-effectively deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate the performance and accuracy of DL workloads to be deployed on Qualcomm devices. DL2q instances are the first to bring Qualcomm’s artificial intelligence (ai) technology to the cloud.
With eight Qualcomm ai 100 Standard accelerators and 128 GiB of total accelerator memory, customers can also use DL2q instances to run popular generative ai applications, such as content generation, text summarization, and virtual assistants, as well as classic ai applications for natural language processing. and computer vision. Additionally, Qualcomm ai 100 accelerators feature the same ai technology used in smartphones, autonomous driving, personal computers, and extended reality headsets, so DL2q instances can be used to develop and validate these ai workloads. ai before its implementation.
Highlights of the new DL2q instance
Each DL2q instance incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregate performance of more than 2.8 PetaOps of Int8 inference performance and 1.4 PetaFlops of FP16 inference performance. The instance has a total of 112 ai cores, an accelerator memory capacity of 128 GB, and a memory bandwidth of 1.1 TB per second.
Each DL2q instance has 96 vCPUs, a system memory capacity of 768 GB, and supports 100 Gbps network bandwidth as well as 19 Gbps Amazon Elastic Block Store (Amazon EBS) storage.
Instance name | vCPU | Cloud AI100 Accelerators | Throttle memory | BW Throttle Memory (added) | instance memory | Instance networks | Storage Bandwidth (Amazon EBS) |
DL2q.24xlarge | 96 | 8 | 128GB | 1,088 TB/s | 768GB | 100Gbps | 19Gbps |
Qualcomm Cloud AI100 Accelerator Innovation
The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multicore architecture that supports a wide range of deep learning use cases spanning from the data center to the edge. The SoC employs scalar, vector and tensor computing cores with an industry-leading integrated SRAM capacity of 126MB. The cores are interconnected with a low-latency, high-bandwidth network-on-chip (NoC) mesh.
The AI100 accelerator supports a wide and complete range of models and use cases. The following table highlights the model’s range of support.
Model category | Number of models | Examples |
NLP | 157 | BERT, BART, FasterTransformer, T5, Z-code MOE |
Generative ai – NLP | 40 | LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen |
Generative ai – Image | 3 | Stable release v1.5 and v2.1, OpenAI CLIP |
CV – Image Classification | Four. Five | ViT, ResNet, ResNext, MobileNet, EfficientNet |
CV – Object Detection | 23 | YOLO v2, v3, v4, v5 and v7, SSD-ResNet, RetinaNet |
CV – Other | fifteen | LPRNet, Super Resolution/SRGAN, ByteTrack |
Automotive networks* | 53 | LIDAR, pedestrian, lane and traffic light perception and detection |
Total | >300 | |
* Most automotive networks are composite networks consisting of an amalgamation of individual networks.
The large SRAM built into the DL2q accelerator enables efficient implementation of advanced performance techniques, such as MX6 micro-exponent precision for storing weights and MX9 micro-exponent precision for inter-accelerator communication. Microexponent technology is described in the following industry announcement from the Open Compute Project (OCP): ai” target=”_blank” rel=”noopener”>AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for ai » Open Compute Project.
The instance user can use the following strategy to maximize performance per cost:
- Store weights using the precision of the MX6 microexponent in DDR memory in the accelerator. Using MX6 precision maximizes the utilization of available memory capacity and memory bandwidth to deliver best-in-class performance and latency.
- Compute at FP16 to deliver the accuracy required in the use case, while utilizing the on-chip top SRAM and on-board spare TOPs, to implement high-performance, low-latency MX6 to FP16 cores.
- Use an optimized batching strategy and larger batch size by using available large on-chip SRAM to maximize weight reuse while keeping on-chip activations as high as possible.
DL2q ai Stack and Toolchain
The DL2q instance is accompanied by the Qualcomm ai Stack, which offers a consistent developer experience across Qualcomm ai in the cloud and other Qualcomm products. The same Qualcomm ai stack and base ai technology run on DL2q instances and Qualcomm edge devices, providing customers with a consistent developer experience, with a unified API across their cloud development environments. , automotive, personal computers, extended reality and smartphones.
The toolchain allows the instance user to quickly ingest a pre-trained model, build and optimize the model for the instance’s capabilities, and then deploy the built models to production inference use cases in three steps shown in the following figure.
For more information about tuning the performance of a model, see the ai-sdk-pages/Getting-Started/Inference-Workflow/model-compilation/Tune%20performance/” target=”_blank” rel=”noopener”>Cloud ai 100 Key Performance Parameters Documentation.
Get started with DL2q instances
In this example, you will compile and deploy a pre-trained program BERT model of hugging face on an EC2 DL2q instance using an available pre-built DL2q AMI, in four steps.
You can use a pre-built Qualcomm DLAMI in your instance, or start with an Amazon Linux2 AMI and create your own DL2q AMI using the Cloud ai 100 platform and Application SDK available in this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/latest/
.
The steps that follow use the pre-built DL2q AMI, Qualcomm AL2 DLAMI Base.
Use SSH to access your DL2q instance with the Qualcomm Base AL2 DLAMI AMI and follow steps 1-4.
Step 1. Configure the environment and install the necessary packages
- Install Python 3.8.
- Set up the Python 3.8 virtual environment.
- Activate the Python 3.8 virtual environment.
- Install the necessary packages, shown in the ai-sdk/blob/1.10/tutorials/NLP/Model-Onboarding-Beginner/requirements.txt” target=”_blank” rel=”noopener”>document requirements.txt available on Qualcomm’s public Github site.
- Import the necessary libraries.
Step 2. Import the model
- Import and tokenize the model.
- Define a sample input and extract the
inputIds
andattentionMask
. - Convert the model to ONNX, which can then be passed to the compiler.
- It will run the model with FP16 accuracy. Therefore, you should check if the model contains constants beyond the FP16 range. Pass the model to
fix_onnx_fp16
function to generate the new ONNX file with the necessary corrections.
Step 3. Compile the model.
He qaic-exec
The command line interface (CLI) build tool is used to build the model. The input to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (called QPCfor Qualcomm program container) on the path defined by -aic-binary-dir
argument.
In the following build command, you use four ai compute cores and a batch size of one to build the model.
The QPC is generated in the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc
file.
Step 4. Run the model
Configure a session to run inference on a Qualcomm Cloud AI100 accelerator on the DL2q instance.
The Qualcomm qaic Python library is a set of APIs that provide support for running inference on the Cloud AI100 accelerator.
- Use the session API call to create a session instance. The session API call is the entry point to using Python’s qaic library.
- Restructure output buffer data with
output_shape
andoutput_type
. - Decode the produced output.
Here are the results for the input sentence “The dog (MASK) on the mat.”
That’s all. With just a few steps, he built and ran a PyTorch model on an Amazon EC2 DL2q instance. For more information about adding and building models to your DL2q instance, see the ai-sdk” target=”_blank” rel=”noopener”>Cloud AI100 Tutorial Documentation.
For more information about which DL model architectures are suitable for AWS DL2q instances and the current model compatibility matrix, see the ai-sdk-pages/” target=”_blank” rel=”noopener”>Qualcomm Cloud AI100 Documentation.
Available now
You can launch DL2q instances today in the AWS US West (Oregon) and Europe (Frankfurt) regions as on-demand, reserved, and spot instances, or as part of a savings plan. As usual with Amazon EC2, you only pay for what you use. For more information, see Amazon EC2 Pricing.
DL2q instances can be deployed using AWS Deep Learning AMI (DLAMI), and container images are available through managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.
For more information, visit the Amazon EC2 DL2q instance page and send your feedback to AWS re: Publishing for EC2 or through your usual AWS Support contacts.
About the authors
AK Roy is director of product management at Qualcomm, for ai products and solutions for data centers and cloud. He has over 20 years of product strategy and development experience, with the current focus on best-in-class performance and $/performance end-to-end solutions for ai inference in the cloud, for a wide range of use cases. . including GenAI, LLM, automatic and hybrid ai.
Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). He has more than 15 years of work experience in the field of HPC and ai. At AWS, he focuses on helping customers deploy, optimize, and scale their ai/ML workloads on accelerated compute instances. He is passionate about combining techniques in the fields of HPC and ai. Jianying has a PhD in Computational Physics from the University of Colorado at Boulder.