Amazon EC2 DL2q Instance for Cost-Effective, High-Performance AI Inference Now Generally Available

This is a guest post by AK Roy from Qualcomm ai.

Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm ai 100 Standard accelerators, can be used to cost-effectively deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate the performance and accuracy of DL workloads to be deployed on Qualcomm devices. DL2q instances are the first to bring Qualcomm’s artificial intelligence (ai) technology to the cloud.

With eight Qualcomm ai 100 Standard accelerators and 128 GiB of total accelerator memory, customers can also use DL2q instances to run popular generative ai applications, such as content generation, text summarization, and virtual assistants, as well as classic ai applications for natural language processing. and computer vision. Additionally, Qualcomm ai 100 accelerators feature the same ai technology used in smartphones, autonomous driving, personal computers, and extended reality headsets, so DL2q instances can be used to develop and validate these ai workloads. ai before its implementation.

Highlights of the new DL2q instance

Each DL2q instance incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregate performance of more than 2.8 PetaOps of Int8 inference performance and 1.4 PetaFlops of FP16 inference performance. The instance has a total of 112 ai cores, an accelerator memory capacity of 128 GB, and a memory bandwidth of 1.1 TB per second.

Each DL2q instance has 96 vCPUs, a system memory capacity of 768 GB, and supports 100 Gbps network bandwidth as well as 19 Gbps Amazon Elastic Block Store (Amazon EBS) storage.

Instance name	vCPU	Cloud AI100 Accelerators	Throttle memory	BW Throttle Memory (added)	instance memory	Instance networks	Storage Bandwidth (Amazon EBS)
DL2q.24xlarge	96	8	128GB	1,088 TB/s	768GB	100Gbps	19Gbps

Qualcomm Cloud AI100 Accelerator Innovation

The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multicore architecture that supports a wide range of deep learning use cases spanning from the data center to the edge. The SoC employs scalar, vector and tensor computing cores with an industry-leading integrated SRAM capacity of 126MB. The cores are interconnected with a low-latency, high-bandwidth network-on-chip (NoC) mesh.

The AI100 accelerator supports a wide and complete range of models and use cases. The following table highlights the model’s range of support.

Model category	Number of models	Examples
NLP	157	BERT, BART, FasterTransformer, T5, Z-code MOE
Generative ai – NLP	40	LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen
Generative ai – Image	3	Stable release v1.5 and v2.1, OpenAI CLIP
CV – Image Classification	Four. Five	ViT, ResNet, ResNext, MobileNet, EfficientNet
CV – Object Detection	23	YOLO v2, v3, v4, v5 and v7, SSD-ResNet, RetinaNet
CV – Other	fifteen	LPRNet, Super Resolution/SRGAN, ByteTrack
Automotive networks*	53	LIDAR, pedestrian, lane and traffic light perception and detection
Total	>300

* Most automotive networks are composite networks consisting of an amalgamation of individual networks.

The large SRAM built into the DL2q accelerator enables efficient implementation of advanced performance techniques, such as MX6 micro-exponent precision for storing weights and MX9 micro-exponent precision for inter-accelerator communication. Microexponent technology is described in the following industry announcement from the Open Compute Project (OCP): ai” target=”_blank” rel=”noopener”>AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for ai » Open Compute Project.

The instance user can use the following strategy to maximize performance per cost:

Store weights using the precision of the MX6 microexponent in DDR memory in the accelerator. Using MX6 precision maximizes the utilization of available memory capacity and memory bandwidth to deliver best-in-class performance and latency.
Compute at FP16 to deliver the accuracy required in the use case, while utilizing the on-chip top SRAM and on-board spare TOPs, to implement high-performance, low-latency MX6 to FP16 cores.
Use an optimized batching strategy and larger batch size by using available large on-chip SRAM to maximize weight reuse while keeping on-chip activations as high as possible.

DL2q ai Stack and Toolchain

The DL2q instance is accompanied by the Qualcomm ai Stack, which offers a consistent developer experience across Qualcomm ai in the cloud and other Qualcomm products. The same Qualcomm ai stack and base ai technology run on DL2q instances and Qualcomm edge devices, providing customers with a consistent developer experience, with a unified API across their cloud development environments. , automotive, personal computers, extended reality and smartphones.

The toolchain allows the instance user to quickly ingest a pre-trained model, build and optimize the model for the instance’s capabilities, and then deploy the built models to production inference use cases in three steps shown in the following figure.

For more information about tuning the performance of a model, see the ai-sdk-pages/Getting-Started/Inference-Workflow/model-compilation/Tune%20performance/” target=”_blank” rel=”noopener”>Cloud ai 100 Key Performance Parameters Documentation.

Get started with DL2q instances

In this example, you will compile and deploy a pre-trained program BERT model of hugging face on an EC2 DL2q instance using an available pre-built DL2q AMI, in four steps.

You can use a pre-built Qualcomm DLAMI in your instance, or start with an Amazon Linux2 AMI and create your own DL2q AMI using the Cloud ai 100 platform and Application SDK available in this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/latest/.

The steps that follow use the pre-built DL2q AMI, Qualcomm AL2 DLAMI Base.

Use SSH to access your DL2q instance with the Qualcomm Base AL2 DLAMI AMI and follow steps 1-4.

Step 1. Configure the environment and install the necessary packages

Install Python 3.8.

sudo amazon-linux-extras install python3.8

Set up the Python 3.8 virtual environment.

python3.8 -m venv /home/ec2-user/userA/pyenv

Activate the Python 3.8 virtual environment.

source /home/ec2-user/userA/pyenv/bin/activate

Install the necessary packages, shown in the ai-sdk/blob/1.10/tutorials/NLP/Model-Onboarding-Beginner/requirements.txt” target=”_blank” rel=”noopener”>document requirements.txt available on Qualcomm’s public Github site.
```
pip3 install -r requirements.txt
```

Import the necessary libraries.

import transformers 
from transformers import AutoTokenizer, AutoModelForMaskedLM
import sys
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np

Step 2. Import the model

Import and tokenize the model.

model_card = 'bert-base-cased'
model = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)

Define a sample input and extract the inputIds and attentionMask.

sentence = "The dog (MASK) on the mat."
encodings = tokenizer(sentence, max_length=128, truncation=True, padding="max_length", return_tensors="pt")
inputIds = encodings("input_ids")
attentionMask = encodings("attention_mask")

Convert the model to ONNX, which can then be passed to the compiler.

# Set dynamic dims and axes.
dynamic_dims = {0: 'batch', 1 : 'sequence'}
dynamic_axes = {
    "input_ids" : dynamic_dims,
    "attention_mask" : dynamic_dims,
    "logits" : dynamic_dims
}
input_names = ("input_ids", "attention_mask")
inputList = (inputIds, attentionMask)

torch.onnx.export(
    model,
    args=tuple(inputList),
    f=f"{gen_models_path}/{model_base_name}.onnx",
    verbose=False,
    input_names=input_names,
    output_names=("logits"),
    dynamic_axes=dynamic_axes,
    opset_version=11,
)

It will run the model with FP16 accuracy. Therefore, you should check if the model contains constants beyond the FP16 range. Pass the model to fix_onnx_fp16 function to generate the new ONNX file with the necessary corrections.

from onnx import numpy_helper
        
def fix_onnx_fp16(
    gen_models_path: str,
    model_base_name: str,
) -> str:
    finfo = np.finfo(np.float16)
    fp16_max = finfo.max
    fp16_min = finfo.min
    model = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
    fp16_fix = False
    for tensor in onnx.external_data_helper._get_all_tensors(model):
        nptensor = numpy_helper.to_array(tensor, gen_models_path)
        if nptensor.dtype == np.float32 and (
            np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
        ):
            # print(f'tensor value : {nptensor} above {fp16_max} or below {fp16_min}')
            nptensor = np.clip(nptensor, fp16_min, fp16_max)
            new_tensor = numpy_helper.from_array(nptensor, tensor.name)
            tensor.CopyFrom(new_tensor)
            fp16_fix = True
            
    if fp16_fix:
        # Save FP16 model
        print("Found constants out of FP16 range, clipped to FP16 range")
        model_base_name += "_fix_outofrange_fp16"
        onnx.save(model, f=f"{gen_models_path}/{model_base_name}.onnx")
        print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
    return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)

Step 3. Compile the model.

He qaic-exec The command line interface (CLI) build tool is used to build the model. The input to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (called QPCfor Qualcomm program container) on the path defined by -aic-binary-dir argument.

In the following build command, you use four ai compute cores and a batch size of one to build the model.

/opt/qti-aic/exec/qaic-exec \
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx \
-aic-num-cores=4 \
-convert-to-fp16 \
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128 \
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc \
-aic-hw -aic-hw-version=2.0 \
-compile-only

The QPC is generated in the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc file.

Step 4. Run the model

Configure a session to run inference on a Qualcomm Cloud AI100 accelerator on the DL2q instance.

The Qualcomm qaic Python library is a set of APIs that provide support for running inference on the Cloud AI100 accelerator.

Use the session API call to create a session instance. The session API call is the entry point to using Python’s qaic library.

qpcPath="bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc"

bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', num_activations=1)  
bert_sess.setup() # Loads the network to the device. 

# Here we are reading out all the input and output shapes/types
input_shape, input_type = bert_sess.model_input_shape_dict('input_ids')
attn_shape, attn_type = bert_sess.model_input_shape_dict('attention_mask')
output_shape, output_type = bert_sess.model_output_shape_dict('logits')

#create the input dictionary for given input sentence
input_dict = {"input_ids": inputIds.numpy().astype(input_type), "attention_mask" : attentionMask.numpy().astype(attn_type)}

#run inference on Cloud ai 100
output = bert_sess.run(input_dict)

Restructure output buffer data with output_shape and output_type.

token_logits = np.frombuffer(output('logits'), dtype=output_type).reshape(output_shape)

Decode the produced output.

mask_token_logits = torch.from_numpy(token_logits(0, mask_token_index, :)).unsqueeze(0)
top_5_results = torch.topk(mask_token_logits, 5, dim=1)
print("Model output (top5) from Qualcomm Cloud ai 100:")
for i in range(5):
    idx = top_5_results.indices(0).tolist()(i)
    val = top_5_results.values(0).tolist()(i)
    word = tokenizer.decode((idx))
    print(f"{i+1} :(word={word}, index={idx}, logit={round(val,2)})")

Here are the results for the input sentence “The dog (MASK) on the mat.”

1 :(word=sat, index=2068, logit=11.46)
2 :(word=landed, index=4860, logit=11.11)
3 :(word=spat, index=15732, logit=10.95)
4 :(word=settled, index=3035, logit=10.84)
5 :(word=was, index=1108, logit=10.75)

That’s all. With just a few steps, he built and ran a PyTorch model on an Amazon EC2 DL2q instance. For more information about adding and building models to your DL2q instance, see the ai-sdk” target=”_blank” rel=”noopener”>Cloud AI100 Tutorial Documentation.

For more information about which DL model architectures are suitable for AWS DL2q instances and the current model compatibility matrix, see the ai-sdk-pages/” target=”_blank” rel=”noopener”>Qualcomm Cloud AI100 Documentation.

Available now

You can launch DL2q instances today in the AWS US West (Oregon) and Europe (Frankfurt) regions as on-demand, reserved, and spot instances, or as part of a savings plan. As usual with Amazon EC2, you only pay for what you use. For more information, see Amazon EC2 Pricing.

DL2q instances can be deployed using AWS Deep Learning AMI (DLAMI), and container images are available through managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

For more information, visit the Amazon EC2 DL2q instance page and send your feedback to AWS re: Publishing for EC2 or through your usual AWS Support contacts.

About the authors

AK Roy is director of product management at Qualcomm, for ai products and solutions for data centers and cloud. He has over 20 years of product strategy and development experience, with the current focus on best-in-class performance and $/performance end-to-end solutions for ai inference in the cloud, for a wide range of use cases. . including GenAI, LLM, automatic and hybrid ai.

Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). He has more than 15 years of work experience in the field of HPC and ai. At AWS, he focuses on helping customers deploy, optimize, and scale their ai/ML workloads on accelerated compute instances. He is passionate about combining techniques in the fields of HPC and ai. Jianying has a PhD in Computational Physics from the University of Colorado at Boulder.

Amazon EC2 DL2q Instance for Cost-Effective, High-Performance AI Inference Now Generally Available

Technical Terrence Team

Alchemy Pay obtains money transmission license in Iowa

Leave a Reply Cancel reply

Recommended.

ETH breaks above $1700 again, as markets rally on Saturday – Market Updates Bitcoin News

Introducing GPT

Friend.Tech User Drops Keys Worth $1.5 Million in Ethereum Only to Move to Bitcoin-Based Alternative

Bitcoin at a crossroads: Bullish momentum faces selling pressure – Details

GM now has its own API for software developers to create cool apps for its cars

Categories

Important Links

Amazon EC2 DL2q Instance for Cost-Effective, High-Performance AI Inference Now Generally Available

Highlights of the new DL2q instance

Qualcomm Cloud AI100 Accelerator Innovation

DL2q ai Stack and Toolchain

Get started with DL2q instances

Step 1. Configure the environment and install the necessary packages

Step 2. Import the model

Step 3. Compile the model.

Step 4. Run the model

Available now

About the authors

Related

Technical Terrence Team

Alchemy Pay obtains money transmission license in Iowa

Leave a Reply Cancel reply

Recommended.

ETH breaks above $1700 again, as markets rally on Saturday – Market Updates Bitcoin News

Introducing GPT

Friend.Tech User Drops Keys Worth $1.5 Million in Ethereum Only to Move to Bitcoin-Based Alternative

Bitcoin at a crossroads: Bullish momentum faces selling pressure – Details

GM now has its own API for software developers to create cool apps for its cars

Categories

Important Links

Get daily news updates to your inbox!