We are pleased to announce a new release of amazon SageMaker for Kubernetes operators using the AWS Drivers for Kubernetes (ACK). ACK is a framework for creating custom Kubernetes controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources such as buckets, databases, or message queues simply by using the Kubernetes API.
Release v1.2.9 SageMaker's ACK operators adds support for inference components, which until now were only available through the SageMaker API and AWS software development kits (SDKs). Inference components can help you optimize deployment costs and reduce latency. With the new capabilities of the inference component, you can deploy one or more base models (FMs) to the same amazon SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. This helps improve resource utilization, reduces model deployment costs on average by 50%, and allows you to scale endpoints along with your use cases. For more details, see amazon SageMaker adds new inference capabilities to help reduce latency and deployment costs of the base model.
The availability of inference components through the SageMaker controller allows customers using Kubernetes as a control plane to take advantage of inference components while deploying their models in SageMaker.
In this post, we show how to use SageMaker ACK operators to implement SageMaker inference components.
How does ACK work?
Demonstrate how ACK worksLet's look at an example that uses amazon Simple Storage Service (amazon S3). In the following diagram, Alice is our Kubernetes user. The implementation of it depends on the existence of an S3 bucket called my-bucket
.
The workflow consists of the following steps:
- Alice issues a call to
kubectl apply
passing a file that describes a Kubernetes custom resource describing your S3 bucket.kubectl apply
pass this file, called manifestto the Kubernetes API server running on the Kubernetes controller node. - The Kubernetes API server receives the manifest that describes the S3 bucket and determines whether Alice has permissions to create a custom resource gentle
s3.services.k8s.aws/Bucket
and that the custom resource is in the appropriate format. - If Alice is authorized and the custom resource is valid, the Kubernetes API server writes the custom resource to its
etcd
Data warehouse. - It then responds to Alice that the custom resource has been created.
- At this point, the ACK service controller for amazon S3, running on a Kubernetes worker node within the context of a regular Kubernetes SheathYou are notified that a new custom resource of type
s3.services.k8s.aws/Bucket
It has been created. - The ACK service handler for amazon S3 then communicates with the amazon S3 API and calls the S3 CreateBucket API to create the bucket on AWS.
- After communicating with the amazon S3 API, the ACK service handler calls the Kubernetes API server to update the custom resource. state with information you received from amazon S3.
Key components
The new inference capabilities are based on SageMaker real-time inference endpoints. As before, create the SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. The model is configured in a new construct, an inference component. Here, you specify the number of accelerators and the amount of memory you want to allocate to each copy of a model, along with the model artifacts, container image, and number of model copies to be deployed.
You can use the new inference capabilities of amazon SageMaker Studio, the SageMaker Python SDK, AWS SDK, and the AWS Command Line Interface (AWS CLI). They are also supported by AWS CloudFormation. Now you can also use them with SageMaker operators for Kubernetes.
Solution Overview
For this demo, we use the SageMaker driver to implement a copy of the Dolly v2 7B model and a copy of Model FLAN-T5 XXL from the Hugging Face Model Center into a real-time SageMaker endpoint using new inference capabilities.
Previous requirements
To continue, you must have a Kubernetes cluster with the SageMaker ACK driver v1.2.9 or higher installed. For instructions on how to provision an amazon Elastic Kubernetes Service (amazon EKS) cluster with amazon Elastic Compute Cloud (amazon EC2) managed Linux nodes using eksctl, see Getting started with amazon EKS – eksctl. For instructions on how to install the SageMaker driver, see Machine Learning with SageMaker ACK Driver.
You need access to accelerated instances (GPUs) to host the LLMs. This solution uses an instance of ml.g5.12xlarge; You can check the availability of these instances in your AWS account and request them as needed using a service quota increase request, as shown in the following screenshot.
Create an inference component
To create your inference component, define the EndpointConfig
, Endpoint
, Model
and InferenceComponent
YAML files, similar to those shown in this section. Wear kubectl apply -f <yaml file>
to create the Kubernetes resources.
You can list the status of the resource via kubectl describe <resource-type>
; For example, kubectl describe inferencecomponent
.
You can also create the inference component without a model resource. See the guide provided in the API documentation for more details.
EndpointConfig YAML
The following is the code for the EndpointConfig file:
YAML endpoint
The following is the code for the Endpoint file:
YAML model
The following is the code for the model file:
YAML inference components
In the following YAML files, since the ml.g5.12xlarge instance comes with 4 GPUs, we assign 2 GPUs, 2 CPUs, and 1024 MB of memory to each model:
Invoke models
Now you can invoke the models using the following code:
Update an inference component
To update an existing inference component, you can update the YAML files and then use kubectl apply -f <yaml file>
. The following is an example of an updated file:
Delete an inference component
To remove an existing inference component, use the command kubectl delete -f <yaml file>
.
Availability and prices
The new SageMaker inference capabilities are available today in the AWS US East (Ohio, Northern Virginia), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE) and South America (São Paulo). For pricing details, visit amazon SageMaker Pricing.
Conclusion
In this post, we show how to use SageMaker ACK operators to implement SageMaker inference components. Power on your Kubernetes cluster and deploy your FMs using SageMaker's new inference capabilities today!
About the authors
Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps clients at various stages of their ai/ML and GenAI journey, from those who are just getting started to those who are leading their business with an ai-first strategy.
Amit Arora is an ai and ML architect at amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct professor in the MSc Data Science and Analytics program at Georgetown University in Washington DC.
Suryansh Singh is a software development engineer at AWS SageMaker working on developing distributed machine learning infrastructure solutions for AWS customers at scale.
Saurabh Trikande is a Senior Product Manager for amazon SageMaker Inference. She is passionate about working with clients and motivated by the goal of democratizing machine learning. She focuses on key challenges related to deploying complex machine learning applications, multi-tenant machine learning models, cost optimizations, and how to make deploying deep learning models more accessible. In her free time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with her family.
Juan Liu is a software development engineer on the amazon SageMaker team. His current work focuses on helping developers efficiently host machine learning models and improve inference performance. He is passionate about spatial data analysis and using ai to solve social problems.