Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU

Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of machine learning (ML) models. It gives you the ability to deploy multiple ML models in a single serving container behind a single endpoint. From there, SageMaker manages loading and unloading the models and scaling resources on your behalf based on your traffic patterns. You will benefit from sharing and reusing hosting resources and a reduced operational burden of managing a large quantity of models.

In November 2022, MMEs added support for GPUs, which allows you to run multiple models on a single GPU device and scale GPU instances behind a single endpoint. This satisfies the strong MME demand for deep neural network (DNN) models that benefit from accelerated compute with GPUs. These include computer vision (CV), natural language processing (NLP), and generative AI models. The reasons for the demand include the following:

DNN models are typically large in size and complexity and continue growing at a rapid pace. Taking NLP models as an example, many of them exceed billions of parameters, which requires GPUs to satisfy low latency and high throughput requirements.
We have observed an increased need for customizing these models to deliver hyper-personalized experiences to individual users. As the quantity of these models increases, there is a need for an easier solution to deploy and operationalize many models at scale.
GPU instances are expensive and you want to reuse these instances as much as possible to maximize the GPU utilization and reduce operating cost.

Although all these reasons point to MMEs with GPU as an ideal option for DNN models, it’s advised to perform load testing to find the right endpoint configuration that satisfies your use case requirements. Many factors can influence the load testing results, such as instance type, number of instances, model size, and model architecture. In addition, load testing can help guide the auto scaling strategies using the right metrics rather than iterative trial and error methods.

For those reasons, we put together this post to help you perform proper load testing on MMEs with GPU and find the best configuration for your ML use case. We share our load testing results for some of the most popular DNN models in NLP and CV hosted using MMEs on different instance types. We summarize the insights and conclusion from our test results to help you make an informed decision on configuring your own deployments. Along the way, we also share our recommended approach to performing load testing for MMEs on GPU. The tools and technique recommended determine the optimum number of models that can be loaded per instance type and help you achieve the best price-performance.

Solution overview

For an introduction to MMEs and MMEs with GPU, refer to Create a Multi-Model Endpoint and Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints. For the context of load testing in this post, you can download our sample code from the GitHub repo to reproduce the results or use it as a template to benchmark your own models. There are two notebooks provided in the repo: one for load testing CV models and another for NLP. Several models of varying sizes and architectures were benchmarked on different type of GPU instances: ml.g4dn.2xlarge, ml.g5.2xlarge, and ml.p3.2xlarge. This should provide a reasonable cross section of performance across the following metrics for each instance and model type:

Max number of models that can be loaded into GPU memory
End-to-end response latency observed on the client side for each inference query
Max throughput of queries per second that the endpoint can process without error
Max current users per instances before a failed request is observed

The following table lists the models tested.

Use Case	Model Name	Size On Disk	Number of Parameters
CV	`resnet50`	100Mb	25M
CV	`convnext_base`	352Mb	88M
CV	`vit_large_patch16_224`	1.2Gb	304M
NLP	`bert-base-uncased`	436Mb	109M
NLP	`roberta-large`	1.3Gb	335M

The following table lists the GPU instances tested.

Instance Type	GPU Type	Num of GPUs	GPU Memory (GiB)
ml.g4dn.2xlarge	NVIDIA T4 GPUs	1	16
ml.g5.2xlarge	NVIDIA A10G Tensor Core GPU	1	24
ml.p3.2xlarge	NVIDIA® V100 Tensor Core GPU	1	16

As previously mentioned, the code example can be adopted to other models and instance types.

Note that MMEs currently only support single GPU instances. For the list of supported instance types, refer to Supported algorithms, frameworks, and instances.

The benchmarking procedure is comprised of the following steps:

Retrieve a pre-trained model from a model hub.
Prepare the model artifact for serving on SageMaker MMEs (see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints for more details).
Deploy a SageMaker MME on a GPU instance.
Determine the maximum number of models that can be loaded into the GPU memory within a specified threshold.
Use the Locust Load Testing Framework to simulate traffic that randomly invokes models loaded on the instance.
Collect data and analyze the results.
Optionally, repeat Steps 2–6 after compiling the model to TensorRT.

Steps 4 and 5 warrant a deeper look. Models within a SageMaker GPU MME are loaded into memory in a dynamic fashion. Therefore, in Step 4, we upload an initial model artifact to Amazon Simple Storage Service (Amazon S3) and invoke the model to load it into memory. After the initial invocation, we measure the amount of GPU memory consumed, make a copy of the initial model, invoke the copy of the model to load it into memory, and again measure the total amount of GPU memory consumed. This process is repeated until a specified percent threshold of GPU memory utilization is reached. For the benchmark, we set the threshold to 90% to provide a reasonable memory buffer for inferencing on larger batches or leaving some space to load other less-frequently used models.

Simulate user traffic

After we have determined the number of models, we can run a load test using the Locust Load Testing Framework. The load test simulates user requests to random models and automatically measures metrics such as response latency and throughput.

Locust supports custom load test shapes that allow you to define custom traffic patterns. The shape that was used in this benchmark is shown in the following chart. In the first 30 seconds, the endpoint is warmed up with 10 concurrent users. After 30 seconds, new users are spawned at a rate of two per second, reaching 20 concurrent users at the 40-second mark. The endpoint is then benchmarked steadily with 20 concurrent users until the 60-second mark, at which point Locust again begins to ramp up users at two per second until 40 concurrent users. This pattern of ramping up and steady testing is repeated until the endpoint is ramped up to 200 concurrent users. Depending on your use case, you may want to adjust the load test shape in the locust_benchmark_sm.py to more accurately reflect your expected traffic patterns. For example, if you intend to host larger language models, a load test with 200 concurrent users may not be feasible for a model hosted on a single instance, and you may therefore want to reduce the user count or increase the number of instances. You may also want to extend the duration of the load test to more accurately gauge the endpoint’s stability over a longer period of time.

stages = [
{"duration": 30, "users": 10, "spawn_rate": 5},
{"duration": 60, "users": 20, "spawn_rate": 1},
{"duration": 90, "users": 40, "spawn_rate": 2},
…
]

Note that we have only benchmarked the endpoint with homogeneous models all running on a consistent serving bases using either PyTorch or TensorRT. This is because MMEs are best suited for hosting many models with similar characteristics, such as memory consumption and response time. The benchmarking templates provided in the GitHub repo can still be used to determine whether serving heterogeneous models on MMEs would yield the desired performance and stability.

Benchmark results for CV models

Use the cv-benchmark.ipynb notebook to run load testing for computer vision models. You can adjust the pre-trained model name and instance type parameters to performance load testing on different model and instance type combinations. We purposely tested three CV models in different size ranges from smallest to largest: resnet50 (25M), convnext_base (88M), and vit_large_patch16_224 (304M). You may need to adjust to code if you pick a model outside of this list. additionally, the notebook defaults the input image shape to a 224x224x3 image tensor. Remember to adjust the input shape accordingly if you need to benchmark models that take a different-sized image.

After running through the entire notebook, you will get several performance analysis visualizations. The first two detail the model performance with respect to increasing concurrent users. The following figures are the example visualizations generated for the ResNet50 model running on ml.g4dn.2xlarge, comparing PyTorch (left) vs. TensorRT (right). The top line graphs show the model latency and throughput on the y-axis with increasing numbers of concurrent client workers reflected on the x-axis. The bottom bar charts show the count of successful and failed requests.

Looking across all the computer vision models we tested, we observed the following:

Latency (in milliseconds) is higher, and throughput (requests per second) is lower for bigger models (resnet50 > convnext_base > vit_large_patch16_224).
Latency increase is proportional with the number of users as more requests are queued up on the inference server.
Large models consume more compute resources and can reach their maximum throughput limits with fewer users than a smaller model. This is observed with the vit_large_patch16_224 model, which recorded the first failed request at 140 concurrent users. Being significantly larger than the other two models tested, it had the most overall failed requests at higher concurrency as well. This is a clear signal that the endpoint would need to scale beyond a single instance if the intent is to support more than 140 concurrent users.

At the end of the notebook run, you also get a summary comparison of PyTorch vs. TensorRT models for each of the four key metrics. From our benchmark testing, the CV models all saw a boost in model performance after TensorRT compilation. Taking our ResNet50 model as the example again, latency decreased by 32% while throughput increased by 18%. Although the maximum number of concurrent users stayed the same for ResNet50, the other two models both saw a 14% improvement in the number of concurrent users that they can support. The TensorRT performance improvement, however, came at the expense of higher memory utilization, resulting in fewer models loaded by MMEs. The impact is more for models using a convolutional neural network (CNN). In fact, our ResNet50 model consumed approximately twice the GPU memory going from PyTorch to TensorRT, resulting in 50% fewer models loaded (46 vs. 23). We diagnose this behavior further in the following section.

Benchmark results for NLP models

For the NLP models, use the nlp-benchmark.ipynb notebook to run the load test. The setup of the notebook should look very similar. We tested two NLP models: bert-base-uncased (109M) and roberta-large (335M). The pre-trained model and the tokenizer are both downloaded from the Hugging Face hub, and the test payload is generated from the tokenizer using a sample string. Max sequence length is defaulted at 128. If you need to test longer strings, remember to adjust that parameter. Running through the NLP notebook generates the same set of visualizations: Pytorch (left) vs TensorRT (right).

From these, we observed even more performance benefit of TensorRT for NLP models. Taking the roberta-large model on an ml.g4dn.2xlarge instance for example, inference latency decreased dramatically from 180 milliseconds to 56 milliseconds (a 70% improvement), while throughput improved by 406% from 33 requests per second to 167. Additionally, the maximum number of concurrent users increased by 50%; failed requests were not observed until we reached 180 concurrent users, compared to 120 for the original PyTorch model. In terms of memory utilization, we saw one fewer model loaded for TensorRT (from nine models to eight). However, the negative impact is much smaller compared to what we observed with the CNN-based models.

Analysis on memory utilization

The following table shows the full analysis on memory utilization impact going from PyTorch to TensorRT. We mentioned earlier that CNN-based models are impacted more negatively. The ResNet50 model had an over 50% reduction in number of models loaded across all three GPU instance types. Convnext_base had an even larger reduction at approximately 70% across the board. On the other hand, the impact to the transformer models is small or mixed. vit_large_patch16_224 and roberta-large had an average reduction of approximately 20% and 3%, respectively, while bert-base-uncased had an approximately 40% improvement.

Looking at all the data points as a whole in regards to the superior performance in latency, throughput, and reliability, and the minor impact on the maximum number of models loaded, we recommend the TensorRT model for transformer-based model architectures. For CNNs, we believe further cost performance analysis is needed to make sure the performance benefit outweighs the cost of additional hosting infrastructure.

ML Use Case	Architecture	Model Name	Instance Type	Framework	Max Models Loaded	Diff (%)	Avg. Diff (%)
CV	CNN	`Resnet50`	ml.g4dn.2xlarge	PyTorch	46	-50%	-50%
				TensorRT	23
			ml.g5.2xlarge	PyTorch	70	-51%
				TensorRT	34
			ml.p3.2xlarge	PyTorch	49	-51%
				TensorRT	24
		`Convnext_base`	ml.g4dn.2xlarge	PyTorch	33	-50%	-70%
				TensorRT	10
			ml.g5.2xlarge	PyTorch	50	-70%
				TensorRT	16
			ml.p3.2xlarge	PyTorch	35	-69%
				TensorRT	11
	Transformer	`vit_large_patch16_224`	ml.g4dn.2xlarge	PyTorch	10	-30%	-20%
				TensorRT	7
			ml.g5.2xlarge	PyTorch	15	-13%
				TensorRT	13
			ml.p3.2xlarge	PyTorch	11	-18%
				TensorRT	9
NLP		`Roberta-large`	ml.g4dn.2xlarge	PyTorch	9	-11%	-3%
				TensorRT	8
			ml.g5.2xlarge	PyTorch	13	0%
				TensorRT	13
			ml.p3.2xlarge	PyTorch	9	0%
				TensorRT	9
		`Bert-base-uncased`	ml.g4dn.2xlarge	PyTorch	26	62%	40%
				TensorRT	42
			ml.g5.2xlarge	PyTorch	39	28%
				TensorRT	50
			ml.p3.2xlarge	PyTorch	28	29%
				TensorRT	36

The following tables list our complete benchmark results for all the metrics across all three GPU instances types.

ml.g4dn.2xlarge
Use Case	Architecture	Model Name	Number of Parameters	Framework	Max Models Loaded	Diff (%)	Latency (ms)	Diff (%)	Throughput (qps)	Diff (%)	Max Concurrent Users	Diff (%)
CV	CNN	`resnet50`	25M	PyTorch	46	-50%	164	-32%	120	18%	180	NA
		`resnet50`	25M	TensorRT	23	.	111	.	142	.	180	.
		`convnext_base`	88M	PyTorch	33	-70%	154	-22%	64	102%	140	14%
		`convnext_base`	88M	TensorRT	10	.	120	.	129	.	160	.
	Transformer	`vit_large_patch16_224`	304M	PyTorch	10	-30%	425	-69%	26	304%	140	14%
		`vit_large_patch16_224`	304M	TensorRT	7	.	131	.	105	.	160	.
NLP		`bert-base-uncased`	109M	PyTorch	26	62%	70	-39%	105	142%	140	29%
		`bert-base-uncased`	109M	TensorRT	42	.	43	.	254	.	180	.
		`roberta-large`	335M	PyTorch	9	-11%	187	-70%	33	406%	120	50%
		`roberta-large`	335M	TensorRT	8	.	56	.	167	.	180	.

ml.g5.2xlarge
Use Case	Architecture	Model Name	Number of Parameters	Framework	Max Models Loaded	Diff (%)	Latency (ms)	Diff (%)	Throughput (qps)	Diff (%)	Max Concurrent Users	Diff (%)
CV	CNN	`resnet50`	25M	PyTorch	70	-51%	159	-31%	146	14%	180	11%
		`resnet50`	25M	TensorRT	34	.	110	.	166	.	200	.
		`convnext_base`	88M	PyTorch	50	-68%	149	-23%	134	13%	180	0%
		`convnext_base`	88M	TensorRT	16	.	115	.	152	.	180	.
	Transformer	`vit_large_patch16_224`	304M	PyTorch	15	-13%	149	-22%	105	35%	160	25%
		`vit_large_patch16_224`	304M	TensorRT	13	.	116	.	142	.	200	.
NLP		`bert-base-uncased`	109M	PyTorch	39	28%	65	-29%	183	38%	180	11%
		`bert-base-uncased`	109M	TensorRT	50	.	46	.	253	.	200	.
		`roberta-large`	335M	PyTorch	13	0%	97	-38%	121	46%	140	14%
		`roberta-large`	335M	TensorRT	13	.	60	.	177	.	160	.

ml.p3.2xlarge
Use Case	Architecture	Model Name	Number of Parameters	Framework	Max Models Loaded	Diff (%)	Latency (ms)	Diff (%)	Throughput (qps)	Diff (%)	Max Concurrent Users	Diff (%)
CV	CNN	`resnet50`	25M	PyTorch	49	-51%	197	-41%	94	18%	160	-12%
		`resnet50`	25M	TensorRT	24	.	117	.	111	.	140	.
		`convnext_base`	88M	PyTorch	35	-69%	178	-23%	89	11%	140	14%
		`convnext_base`	88M	TensorRT	11	.137	137	.	99	.	160	.
	Transformer	`vit_large_patch16_224`	304M	PyTorch	11	-18%	186	-28%	83	23%	140	29%
		`vit_large_patch16_224`	304M	TensorRT	9	.	134	.	102	.	180	.
NLP		`bert-base-uncased`	109M	PyTorch	28	29%	77	-40%	133	59%	140	43%
		`bert-base-uncased`	109M	TensorRT	36	.	46	.	212	.	200	.
		`roberta-large`	335M	PyTorch	9	0%	108	-44%	88	60%	160	0%
		`roberta-large`	335M	TensorRT	9	.	61	.	141	.	160	.

The following table summarizes the results across all instance types. The ml.g5.2xlarge instance provides the best performance, whereas the ml.p3.2xlarge instance generally underperforms despite being the most expensive of the three. The g5 and g4dn instances demonstrate the best value for inference workloads.

Use Case	Architecture	Model Name	Number of Parameters	Framework	Instance Type	Max Models Loaded	Diff (%)	Latency (ms)	Diff (%)	Throughput (qps)	Diff (%)	Max Concurrent Users
CV	CNN	`resnet50`	25M	PyTorch	ml.g5.2xlarge	70	.	159	.	146	.	180
.	.	.	.	.	ml.p3.2xlarge	49	.	197	.	94	.	160
.	.	.	.	.	ml.g4dn.2xlarge	46	.	164	.	120	.	180
CV	CN	`resnet50`	25M	TensorRT	ml.g5.2xlarge	34	-51%	110	-31%	166	14%	200
.	.	.	.	.	ml.p3.2xlarge	24	-51%	117	-41%	111	18%	200
.	.	.	.	.	ml.g4dn.2xlarge	23	-50%	111	-32%	142	18%	180
NLP	Transformer	`bert-base-uncased`	109M	Pytorch	ml.g5.2xlarge	39	.	65	.	183	.	180
.	.	.	.	.	ml.p3.2xlarge	28	.	77	.	133	.	140
.	.	.	.	.	ml.g4dn.2xlarge	26	.	70	.	105	.	140
NLP	Transformer	`bert-base-uncased`	109M	TensorRT	ml.g5.2xlarge	50	28%	46	-29%	253	38%	200
.	.	.	.	.	ml.p3.2xlarge	36	29%	46	-40%	212	59%	200
.	.	.	.	.	ml.g4dn.2xlarge	42	62%	43	-39%	254	142%	180

Clean up

After you complete your load test, clean up the generated resources to avoid incurring additional charges. The main resources are the SageMaker endpoints and model artifact files in Amazon S3. To make it easy for you, the notebook files have the following cleanup code to help you delete them:

delete_endpoint(sm_client, sm_model_name, endpoint_config_name, endpoint_name)

! aws s3 rm --recursive {trt_mme_path}

Conclusion

In this post, we shared our test results and analysis for various deep neural network models running on SageMaker multi-model endpoints with GPU. The results and insights we shared should provide a reasonable cross section of performance across different metrics and instance types. In the process, we also introduced our recommended approach to run benchmark testing for SageMaker MMEs with GPU. The tools and sample code we provided can help you quickstart your benchmark testing and make a more informed decision on how to cost-effectively host hundreds of DNN models on accelerated compute hardware. To get started with benchmarking your own models with MME support for GPU, refer to Supported algorithms, frameworks, and instances and the GitHub repo for additional examples and documentation.

About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.