Kubernetes is a popular orchestration platform for managing containers. Its scalability and load balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated, and if the quality of the resulting model is satisfactory, it must be uploaded to a model registry.
amazon SageMaker offers capabilities to remove the heavy, undifferentiated work of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto-scaling, and monitoring. Specifically for the model creation stage, amazon SageMaker Pipelines automates the process by managing the infrastructure and resources required to process data, train models, and run evaluation tests.
A challenge for DevOps engineers is the added complexity of using Kubernetes to manage the deployment stage and relying on other tools (such as AWS SDK or AWS CloudFormation) to manage the model creation flow. An alternative to simplify this process is to use AWS Drivers for Kubernetes (ACK) to manage and deploy a SageMaker training workflow. ACK allows you to take advantage of managed model building workflows without needing to define resources outside of your Kubernetes cluster.
In this post, we present an example to help DevOps engineers manage the entire ML lifecycle, including training and inference, using the same toolkit.
Solution Overview
We consider a use case where an ML engineer sets up a SageMaker model-building pipeline using a Jupyter notebook. This setup takes the form of a directed acyclic graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an amazon Simple Storage Service (amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for amazon S3. A DevOps engineer with access to obtain this amazon S3 definition file can upload the pipeline definition to an ACK Service Controller for SageMaker, which runs as part of an amazon Elastic Kubernetes Service (amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline executions in SageMaker. This entire workflow is shown in the following solution diagram.
Prerequisites
To follow the course, it is necessary to meet the following prerequisites:
- An EKS cluster where the ML pipeline will be built.
- A user with access to an AWS Identity and Access Management (IAM) role that has IAM permissions (
iam:CreateRole
,iam:AttachRolePolicy
andiam:PutRolePolicy
) to allow the creation of roles and associating policies with roles. - The following command line tools on the local machine or cloud-based development environment are used to access the Kubernetes cluster:
Install SageMaker ACK Service Handler
The SageMaker ACK Service Controller makes it easy for DevOps engineers to use Kubernetes as a control plane to create and manage ML pipelines. To install the controller on your EKS cluster, complete the following steps:
- Configure IAM permissions to ensure that the controller has access to the appropriate AWS resources.
- Install the driver using a SageMaker Helm chart to make it available on the client machine.
The following tutorial Provides step-by-step instructions with the commands required to install the ACK Service Driver for SageMaker.
Generate a pipeline JSON definition
In most companies, ML engineers are responsible for building the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the following guidelines: schemewhich includes base images, dependencies, steps, and instance types and sizes needed to fully define the pipeline. The DevOps engineer then retrieves this definition to deploy and maintain the infrastructure needed for the pipeline.
Below is an example pipeline definition with a training step:
With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts them by default using the AWS Managed Key for amazon S3. Optionally, you can specify a custom key using the AWS Managed Key. KmsKeyId
property of the OutputDataConfig
argument. For more information about how SageMaker protects data, see Data Protection in amazon SageMaker.
Additionally, we recommend securing access to pipeline artifacts, such as model output and training data, to a specific set of IAM roles built for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in amazon S3, see Top 10 Security Best Practices for Securing Data in amazon S3.
Create and submit a pipeline YAML specification
In the world of Kubernetes, objects Objects are the persistent entities in the Kubernetes cluster that are used to represent the state of the cluster. When you create an object in Kubernetes, you need to provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools like kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.
Please see the following Kubernetes YAML specification for a SageMaker pipelineDevOps engineers need to modify the .spec.pipelineDefinition
Enter the key in the file and add the pipeline JSON definition provided by the ML engineer. They then prepare and submit a standalone pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:
- Pass the inline pipeline definition as a JSON object to the pipeline YAML specification.
- Convert the JSON pipeline definition to string format using the jq command-line utility. For example, you can use the following command to convert the pipeline definition to a JSON-encoded string:
In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml
) as follows:
Send the pipeline to SageMaker
To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:
Create and submit a pipeline execution YAML specification
Please see the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml
) as follows:
To start a pipeline execution, use the following code:
Review and troubleshoot pipeline execution
To list all the pipes created using the ACK handler, use the following command:
To list all pipeline executions, use the following command:
To get more details about the pipeline after submitting it, such as checking the status, errors, or parameters of the pipeline, use the following command:
To troubleshoot a pipeline execution by reviewing more details about the execution, use the following command:
Clean
Use the following command to delete any pipelines you have created:
Use the following command to cancel any pipeline execution you have started:
Conclusion
In this post, we present an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can work efficiently with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This allows DevOps engineers to manage all steps of the ML lifecycle with the same toolset and environment they are accustomed to, enabling organizations to innovate faster and more efficiently.
Explore the GitHub repository for ACK and the SageMaker Controller to start managing your ML operations with Kubernetes.
About the authors
Pratik Yeole is a Senior Solutions Architect working with global customers and helping them build value-driven solutions on AWS. He has domain expertise in Containers and MLOps. Outside of work, he enjoys time with friends, family, music, and cricket.
Felipe Lopez Felipe is a Senior Solutions Architect specializing in ai/ML at AWS. Prior to joining AWS, Felipe worked at GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.