Use Kubernetes Operators to gain new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

We are pleased to announce a new release of amazon SageMaker for Kubernetes operators using the AWS Drivers for Kubernetes (ACK). ACK is a framework for creating custom Kubernetes controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources such as buckets, databases, or message queues simply by using the Kubernetes API.

Release v1.2.9 SageMaker's ACK operators adds support for inference components, which until now were only available through the SageMaker API and AWS software development kits (SDKs). Inference components can help you optimize deployment costs and reduce latency. With the new capabilities of the inference component, you can deploy one or more base models (FMs) to the same amazon SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. This helps improve resource utilization, reduces model deployment costs on average by 50%, and allows you to scale endpoints along with your use cases. For more details, see amazon SageMaker adds new inference capabilities to help reduce latency and deployment costs of the base model.

The availability of inference components through the SageMaker controller allows customers using Kubernetes as a control plane to take advantage of inference components while deploying their models in SageMaker.

In this post, we show how to use SageMaker ACK operators to implement SageMaker inference components.

How does ACK work?

Demonstrate how ACK worksLet's look at an example that uses amazon Simple Storage Service (amazon S3). In the following diagram, Alice is our Kubernetes user. The implementation of it depends on the existence of an S3 bucket called my-bucket.

The workflow consists of the following steps:

Alice issues a call to kubectl applypassing a file that describes a Kubernetes custom resource describing your S3 bucket. kubectl apply pass this file, called manifestto the Kubernetes API server running on the Kubernetes controller node.
The Kubernetes API server receives the manifest that describes the S3 bucket and determines whether Alice has permissions to create a custom resource gentle s3.services.k8s.aws/Bucketand that the custom resource is in the appropriate format.
If Alice is authorized and the custom resource is valid, the Kubernetes API server writes the custom resource to its etcd Data warehouse.
It then responds to Alice that the custom resource has been created.
At this point, the ACK service controller for amazon S3, running on a Kubernetes worker node within the context of a regular Kubernetes SheathYou are notified that a new custom resource of type s3.services.k8s.aws/Bucket It has been created.
The ACK service handler for amazon S3 then communicates with the amazon S3 API and calls the S3 CreateBucket API to create the bucket on AWS.
After communicating with the amazon S3 API, the ACK service handler calls the Kubernetes API server to update the custom resource. state with information you received from amazon S3.

Key components

The new inference capabilities are based on SageMaker real-time inference endpoints. As before, create the SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. The model is configured in a new construct, an inference component. Here, you specify the number of accelerators and the amount of memory you want to allocate to each copy of a model, along with the model artifacts, container image, and number of model copies to be deployed.

You can use the new inference capabilities of amazon SageMaker Studio, the SageMaker Python SDK, AWS SDK, and the AWS Command Line Interface (AWS CLI). They are also supported by AWS CloudFormation. Now you can also use them with SageMaker operators for Kubernetes.

Solution Overview

For this demo, we use the SageMaker driver to implement a copy of the Dolly v2 7B model and a copy of Model FLAN-T5 XXL from the Hugging Face Model Center into a real-time SageMaker endpoint using new inference capabilities.

Previous requirements

To continue, you must have a Kubernetes cluster with the SageMaker ACK driver v1.2.9 or higher installed. For instructions on how to provision an amazon Elastic Kubernetes Service (amazon EKS) cluster with amazon Elastic Compute Cloud (amazon EC2) managed Linux nodes using eksctl, see Getting started with amazon EKS – eksctl. For instructions on how to install the SageMaker driver, see Machine Learning with SageMaker ACK Driver.

You need access to accelerated instances (GPUs) to host the LLMs. This solution uses an instance of ml.g5.12xlarge; You can check the availability of these instances in your AWS account and request them as needed using a service quota increase request, as shown in the following screenshot.

Create an inference component

To create your inference component, define the EndpointConfig, Endpoint, Modeland InferenceComponent YAML files, similar to those shown in this section. Wear kubectl apply -f <yaml file> to create the Kubernetes resources.

You can list the status of the resource via kubectl describe <resource-type>; For example, kubectl describe inferencecomponent.

You can also create the inference component without a model resource. See the guide provided in the API documentation for more details.

EndpointConfig YAML

The following is the code for the EndpointConfig file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
  name: inference-component-endpoint-config
spec:
  endpointConfigName: inference-component-endpoint-config
  executionRoleARN: <EXECUTION_ROLE_ARN>
  productionVariants:
  - variantName: AllTraffic
    instanceType: ml.g5.12xlarge
    initialInstanceCount: 1
    routingConfig:
      routingStrategy: LEAST_OUTSTANDING_REQUESTS

YAML endpoint

The following is the code for the Endpoint file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
  name: inference-component-endpoint
spec:
  endpointName: inference-component-endpoint
  endpointConfigName: inference-component-endpoint-config

YAML model

The following is the code for the model file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: dolly-v2-7b
spec:
  modelName: dolly-v2-7b
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: databricks/dolly-v2-7b
      HF_TASK: text-generation
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: flan-t5-xxl
spec:
  modelName: flan-t5-xxl
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: google/flan-t5-xxl
      HF_TASK: text-generation

YAML inference components

In the following YAML files, since the ml.g5.12xlarge instance comes with 4 GPUs, we assign 2 GPUs, 2 CPUs, and 1024 MB of memory to each model:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-flan
spec:
  inferenceComponentName: inference-component-flan
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: flan-t5-xxl
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Invoke models

Now you can invoke the models using the following code:

import boto3
import json

sm_runtime_client = boto3.client(service_name="sagemaker-runtime")
payload = {"inputs": "Why is California a great place to live?"}

response_dolly = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-dolly",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_dolly = json.loads(response_dolly('Body').read().decode())
print(result_dolly)

response_flan = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-flan",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_flan = json.loads(response_flan('Body').read().decode())
print(result_flan)

Update an inference component

To update an existing inference component, you can update the YAML files and then use kubectl apply -f <yaml file>. The following is an example of an updated file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 4 # Update the numberOfCPUCoresRequired.
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Delete an inference component

To remove an existing inference component, use the command kubectl delete -f <yaml file>.

Availability and prices

The new SageMaker inference capabilities are available today in the AWS US East (Ohio, Northern Virginia), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE) and South America (São Paulo). For pricing details, visit amazon SageMaker Pricing.

Conclusion

In this post, we show how to use SageMaker ACK operators to implement SageMaker inference components. Power on your Kubernetes cluster and deploy your FMs using SageMaker's new inference capabilities today!

About the authors

Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps clients at various stages of their ai/ML and GenAI journey, from those who are just getting started to those who are leading their business with an ai-first strategy.

Amit Arora is an ai and ML architect at amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct professor in the MSc Data Science and Analytics program at Georgetown University in Washington DC.

Suryansh Singh is a software development engineer at AWS SageMaker working on developing distributed machine learning infrastructure solutions for AWS customers at scale.

Saurabh Trikande is a Senior Product Manager for amazon SageMaker Inference. She is passionate about working with clients and motivated by the goal of democratizing machine learning. She focuses on key challenges related to deploying complex machine learning applications, multi-tenant machine learning models, cost optimizations, and how to make deploying deep learning models more accessible. In her free time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with her family.

Juan Liu is a software development engineer on the amazon SageMaker team. His current work focuses on helping developers efficiently host machine learning models and improve inference performance. He is passionate about spatial data analysis and using ai to solve social problems.

Use Kubernetes Operators to gain new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

Technical Terrence Team

1 cent stocks with the potential to change the way the world works forever!

Leave a Reply Cancel reply

Recommended.

Top ABBYY FlexiCapture alternatives for document processing

Animoca Brands Japan announces NFT Launchpad and opens applications for summer 2024

ORDI and Solana move forward as NuggetRush pre-sale captivates investors

Large Ethereum holders move $2 billion in two hours

Making a 3D Printed Violin: An English Teacher's Journey

Categories

Important Links

Use Kubernetes Operators to gain new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

How does ACK work?

Key components

Solution Overview

Previous requirements

Create an inference component

EndpointConfig YAML

YAML endpoint

YAML model

YAML inference components

Invoke models

Update an inference component

Delete an inference component

Availability and prices

Conclusion

About the authors

Related

Technical Terrence Team

1 cent stocks with the potential to change the way the world works forever!

Leave a Reply Cancel reply

Recommended.

Top ABBYY FlexiCapture alternatives for document processing

Animoca Brands Japan announces NFT Launchpad and opens applications for summer 2024

ORDI and Solana move forward as NuggetRush pre-sale captivates investors

Large Ethereum holders move $2 billion in two hours

Making a 3D Printed Violin: An English Teacher's Journey

Categories

Important Links

Get daily news updates to your inbox!