Accelerated PyTorch Inference with Torch.compile on AWS Graviton Processors

Originally, PyTorch used an eager mode where each PyTorch operation that forms the model is executed independently as soon as it is reached. PyTorch 2.0 introduced torch.compile to speed up PyTorch code compared to the default eager mode. Unlike eager mode, the Torch.compile file pre-compiles the entire model into a single graph in a way that is optimal to run on a given hardware platform. AWS optimized PyTorch's Torch.compile function for AWS Graviton3 processors. This optimization results in up to 2x better performance for Hugging face Model inference (based on geometric mean performance improvement for 33 models) and up to 1.35x better performance for Torch bank Model inference (geometric mean performance improvement for 45 models) compared to default eager mode inference on various natural language processing (NLP), computer vision (CV), and recommendation models on amazon EC2 instances based on AWS Graviton3. Starting with PyTorch 2.3.1, optimizations are available in Torch Python wheels and AWS Graviton PyTorch Deep Learning Container (DLC).

In this blog post, we show how we optimized Torch.Compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve inference performance, and the resulting speedups.

Why Torch.Compile and what is its goal?

In eager mode, operators of a model are executed immediately when encountered. It is easier to use, more suitable for machine learning (ML) researchers, and is therefore the default mode. However, eager mode incurs runtime overhead due to redundant kernel startup and memory read overhead. Whereas in Torch compilation mode, operators are first synthesized into a graph, where one operator is merged with another to reduce and localize memory reads and total kernel startup overhead.

The goal of the AWS Graviton team was to optimize the Torch.Compile backend for Graviton3 processors. PyTorch's enthusiast mode was already optimized for Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN (aka MKLDNN). So the question was, how to reuse those kernels in torch compilation mode to get the best of graphics compilation and optimized kernel performance together?

Results

The AWS Graviton team extended the Torch inductor and oneDNN primitives that reused ACL cores and optimized compiler mode performance on Graviton3 processors. Starting with PyTorch 2.3.1, the optimizations are available in Torch Python wheels and the AWS Graviton DLC. See Run an inference Section that follows for instructions on installation, runtime configuration, and how to run tests.

To demonstrate performance improvements, we use NLP, CV, and recommendation models. Torch bank and the most downloaded NLP models of Hugging face in question answering, text classification, token classification, translation, zero-shot classification, translation, summarization, feature extraction, text generation, text-to-text generation, fill masking, and sentence similarity tasks to cover a wide variety of customer use cases.

We start by measuring the TorchBench model inference latency, in milliseconds (ms), for eager mode, which is marked as 1.0 with a red dotted line in the graph below. We then compare Torch.compile improvements for the same model inference; the normalized results are plotted in the graph. You can see that for the 45 models we evaluated, there is a 1.35x latency improvement (geographic average for all 45 models).

Image 1: Improving PyTorch model inference performance with Torch.Compile on AWS Graviton3-based c7g instance using TorchBench framework. Benchmark mode performance is marked as 1.0 (higher is better).

Similar to the TorchBench inference performance graph above, we start by measuring the inference latency of the NLP Hugging Face model, in ms, for eager mode, which is marked as 1.0 with a red dotted line in the graph below. We then compare the improvements of Torch.compile for the same model inference; the normalized results are plotted in the graph. You can see that, for the 33 models we evaluated, there is around a 2x performance improvement (geographic average for all 33 models).

Image 2: Improving Hugging Face NLP model inference performance with Torch.Compile on AWS Graviton3-based c7g instance using Hugging Face example scripts. Baseline eager mode performance is marked as 1.0. (higher is better)

Run an inference

Starting with PyTorch 2.3.1, optimizations are available in the Torch Python wheel and the AWS Graviton PyTorch DLC. This section shows how to run inference in anxious and Torch.Compile modes using Torch Python wheels and benchmark scripts from the Hugging Face and TorchBench repositories.

To successfully run the scripts and reproduce the speedup numbers mentioned in this post, you need an instance of the Graviton3 hardware family (c7g/r7g/m7g/hpc7g). For this post, we are using the c7g.4xl instance (16 vCPU). The instance, AMI details, and required Torch library versions are mentioned in the following snippet.

Instance: c7g.4xl instance
Region: us-west-2
AMI: ami-05cc25bfa725a144a (Ubuntu 22.04/Jammy with 6.5.0-1017-aws kernel)

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

The generic runtime settings implemented for eager mode inference are equally applicable for Torch.Compile mode, therefore, we set the following environment variables to further improve the performance of Torch.Compile on AWS Graviton3 processors.

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

TorchBench Benchmark Scripts

TorchBench is a collection of open-source benchmarks used to evaluate the performance of PyTorch. We evaluated 45 models using the scripts from the TorchBench repository. The following code shows how to run the scripts for run mode in compile mode using the Inductor backend.

# Set OMP_NUM_THREADS to number of vcpus, 16 for c7g.4xl instance
export OMP_NUM_THREADS=16

# Install the dependencies
sudo apt-get install -y libgl1-mesa-glx
sudo apt-get install -y libpangocairo-1.0-0
python3 -m pip install psutil numpy transformers pynvml numba onnx onnxruntime scikit-learn timm effdet gym doctr opencv-python h5py==3.10.0 python-doctr

# Clone pytorch benchmark repo
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# PyTorch benchmark repo doesn't have any release tags. So,
# listing the commit we used for collecting the performance numbers
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295

# Setup the models
python3 install.py

# Colect eager mode performance using the following command. The results will be
# stored at .userbenchmark/cpu/metric-.json.
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --metrics="latencies,cpu_peak_mem"

# Collect torch.compile mode performance with inductor backend
# and weights pre-packing enabled. The results will be stored at
# .userbenchmark/cpu/metric-.json
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

After the inference runs are complete, the script stores the results in JSON format. The sample output is shown below:

{
"name": "cpu"
"environ": {
"pytorch_git_version": "d44533f9d073df13895333e70b66f81c513c1889"
},

"metrics": {
"BERT_pytorch-eval_latency": 56.3769865,
"BERT_pytorch-eval_cmem": 0.4169921875
}
}

Hugging Face Benchmarking Scripts

Google's T5 small text translation model is one of the 30 Hugging Face models we benchmarked. We are using it as a sample model to demonstrate how to run inference in both build and staging modes. The additional configurations and APIs required to run it in build mode are highlighted in BOLDSave the following script as google_t5_small_text_translation.py .

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config config.cpp.weight_prepack=True config.freezing=True

def test_inference(mode, num_iter):
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

    if (mode == 'compile'):         model = torch.compile(model)

with torch.no_grad():
for _ in range(50):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

with profile(activities=(ProfilerActivity.CPU)) as prof:
with record_function("model_inference"):
for _ in range(num_iter):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))

def main() -> None:
global m, args
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"-m",
"--mode",
choices=("eager", "compile"),
default="eager",
help="Which test to run.",
)
parser.add_argument(
"-n",
"--number",
type=int,
default=100,
help="how many iterations to run.",
)
args = parser.parse_args()
test_inference(args.mode, args.number)

if __name__ == "__main__":
main()

Run the script with the following steps.

# Set OMP_NUM_THREADS to number of vcpus to 4 because
# the scripts are running inference in sequence, and
# they don't need large number of vcpus
export OMP_NUM_THREADS=4

# Install the dependencies
python3 -m pip install transformers

# Run the inference script in Eager mode
# using number of iterations as 1 just to show the torch profiler output
# but for the benchmarking, we used 1000 iterations.
python3 google_t5_small_text_translation.py -n 1 -m eager

# Run the inference script in torch compile mode
python3 google_t5_small_text_translation.py -n 1 -m compile

After the inference runs are complete, the script prints the output of the Torch Profiler with the latency breakdown of the Torch operators. The following is the sample output of the Torch Profiler:


# Torch profiler output for the eager mode run on c7g.xl (4vcpu)
---------------    ------------  -----------  ------------  -----------  ------------  ------------
Name                 Self CPU %   Self CPU     CPU total %   CPU total   CPU time avg    # of Calls
---------------    ------------  -----------  ------------  -----------  ------------  ------------
aten::mm            40.71%         12.502ms       40.71%      12.502ms     130.229us            96
model_inference     26.44%         8.118ms       100.00%      30.708ms      30.708ms             1
aten::bmm            6.85%         2.102ms         9.47%       2.908ms      80.778us            36
aten::matmul         3.73%         1.146ms        57.26%      17.583ms     133.205us           132
aten::select         1.88%       576.000us         1.90%     583.000us       0.998us           584
aten::transpose      1.51%       464.000us         1.83%     563.000us       3.027us           186
---------------    ------------  -----------  ------------  -----------  ------------  -------------
Self CPU time total: 30.708ms

# Torch profiler output for the compile mode run for the same model on the same instance
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
Name                      Self CPU %    Self CPU    CPU total %    CPU total   CPU time avg   # of Calls
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
mkldnn::_linear_pointwise   37.98%       5.461ms        45.91%       6.602ms      68.771us            96
Torch-Compiled Region       29.56%       4.251ms        98.53%      14.168ms      14.168ms             1
aten::bmm                   14.90%       2.143ms        21.73%       3.124ms      86.778us            36
aten::select                 4.51%     648.000us         4.62%     665.000us       1.155us           576
aten::view                   3.29%     473.000us         3.29%     473.000us       1.642us           288
aten::empty                  2.53%     364.000us         2.53%     364.000us       3.165us           115
-------------------------  ---------  -----------  ------------  ------------  ------------ -------------
Self CPU time total: 14.379ms

Whats Next

Next, we will extend the Torch Inductor CPU backend support for compiling the Llama model and add support for fused GEMM kernels to enable optimization of Torch Inductor operator fusion on AWS Graviton3 processors.

Conclusion

In this tutorial, we explain how we optimized Torch.Compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve PyTorch model inference performance, and demonstrate the resulting speedups. We hope you'll give it a try! If you need help with machine learning software on Graviton, please open an issue in the AWS Graviton Technical Guide GitHub.

About the Author

Sunita Nadampalli is a Software Development Manager and ai/ML expert at AWS. She leads performance optimizations of AWS Graviton software for ai/ML and HPC workloads. She is passionate about open source software development and delivering high-performance, sustainable software solutions for Arm ISA-based SoCs.

Accelerated PyTorch Inference with Torch.compile on AWS Graviton Processors

Technical Terrence Team

Nvidia tops IBKR's list of most active companies, though AI star's stock 'runs its course'

Leave a Reply Cancel reply

Recommended.

T-Mobile was also infiltrated by telecom hackers linked to China

Introduction to Sampling Methods. Implementing inverse transform… | by Herman Michaels | Jan, 2023

Saudi Arabia stocks closed lower; Tadawul All Share down 0.18% By Investing.com

Gordon E. Moore, Intel co-founder behind Moore’s Law, dies at 94

Best early Prime Day deals ahead of Amazon's July sale: Shop Apple, Anker, and more

Categories

Important Links

Accelerated PyTorch Inference with Torch.compile on AWS Graviton Processors

Why Torch.Compile and what is its goal?

Results

Run an inference

TorchBench Benchmark Scripts

Hugging Face Benchmarking Scripts

Whats Next

Conclusion

About the Author

Related

Technical Terrence Team

Nvidia tops IBKR's list of most active companies, though AI star's stock 'runs its course'

Leave a Reply Cancel reply

Recommended.

T-Mobile was also infiltrated by telecom hackers linked to China

Introduction to Sampling Methods. Implementing inverse transform… | by Herman Michaels | Jan, 2023

Saudi Arabia stocks closed lower; Tadawul All Share down 0.18% By Investing.com

Gordon E. Moore, Intel co-founder behind Moore’s Law, dies at 94

Best early Prime Day deals ahead of Amazon's July sale: Shop Apple, Anker, and more

Categories

Important Links

Get daily news updates to your inbox!