amazon SageMaker has redesigned its Python SDK to provide a unified object-oriented interface that makes it easy to interact with SageMaker services. The new SDK is designed with a tiered user experience in mind, where the new lower-level SDK (SageMaker Core) provides access to a full range of SageMaker features and settings, allowing greater flexibility and control for engineers. by ML. The top-level abstract layer is designed for data scientists with limited AWS experience and offers a simplified interface that hides complex infrastructure details.
In this two-part series, we introduce the SageMaker Python SDK abstract layer that allows you to train and deploy machine learning (ML) models using the new ModelTrainer and improved ModelBuilder classes.
In this post, we focus on the ModelTrainer class to simplify the training experience. The ModelTrainer class provides significant improvements over the current Estimator class, which are discussed in detail in this post. We show you how to use the ModelTrainer class to train your ML models, including running distributed training using a custom script or container. In Part 2, we showed you how to create a model and deploy it to a SageMaker endpoint using the enhanced ModelBuilder class.
Benefits of the ModelTrainer class
The new ModelTrainer class has been designed to address the usability challenges associated with the Estimator class. In the future, ModelTrainer will be the preferred approach for model training, bringing significant improvements that greatly improve the user experience. This evolution marks a step toward achieving the best developer experience for model training. The following are the key benefits:
- Improved intuition – The ModelTrainer class reduces complexity by consolidating configurations into just a few main parameters. This optimization minimizes cognitive overhead, allowing users to focus on model training rather than the complexities of configuration. Additionally, it employs intuitive configuration classes for easy interactions with the platform.
- Simplified script mode and BYOC – The transition from local development to cloud training is now seamless. ModelTrainer automatically maps source code, data paths, and parameter specifications to the remote execution environment, eliminating the need for special handshakes or complex configuration processes.
- Distributed training made easy – The ModelTrainer class provides improved flexibility for users to specify custom commands and distributed training strategies, allowing you to directly supply the exact command you want to run in your container via the
command
parameter in theSourceCode
This approach decouples distributed training strategies from the training toolset and framework-specific estimators. - Improved hyperparameter contracts – The ModelTrainer class passes the training job's hyperparameters as a single environment variable, allowing you to load the hyperparameters using a single
SM_HPS
variable.
To explain each of these benefits in more detail, we demonstrate with examples in the following sections and finally show you how to set up and run distributed training for the Meta Llama 3.1 8B model using the new ModelTrainer
class.
Launch a training job using the ModelTrainer class
The ModelTrainer class simplifies the experience by allowing you to customize the training job, including providing a custom script, directly providing a command to run the training job, supporting local mode, and much more. However, you can start a SageMaker training job in script mode by providing minimal parameters: SourceCode
and the URI of the training image.
The following example illustrates how you can start a training job with your own custom script by providing only the script and the URI of the training image (in this case, PyTorch) and an optional requirements file. Additional parameters, such as instance type and size, are automatically set by the SDK to preset default values, and parameters such as AWS Identity and Access Management (IAM) role and SageMaker session are automatically detected from the session. current and user credentials. Administrators and users can also override the default values using the SDK Defaults configuration file. For a detailed list of preset values, see the SDK Documentation.
With specifically designed configurations, you can now reuse these objects to create multiple training jobs with different hyperparameters, for example, without having to redefine all the parameters.
Run the job locally to experiment.
To run the above training job locally, you can simply set the training_mode
parameter as shown in the following code:
Training work is done remotely because training_mode
is configured to Mode.LOCAL_CONTAINER
. If not explicitly configured, ModelTrainer runs a SageMaker remote training job by default. This behavior can also be applied by changing the value to Mode.SAGEMAKER_TRAINING_JOB
. For a complete list of available configurations, including computing and networking, see the SDK Documentation.
Read the hyperparameters in your custom script
ModelTrainer supports several ways to read hyperparameters passed to a training job. In addition to existing support for reading hyperparameters as command line arguments in your custom script, ModelTrainer also supports reading hyperparameters as individual environment variables, prefixed SM_HPS_
or as a single environment variable dictionary, SM_HPS
.
Suppose the following hyperparameters are passed to the training job:
You have the following options:
- Option 1 – Load the hyperparameters into a single JSON dictionary using the
SM_HPS
environment variable in your custom script:
- Option 2 – Read hyperparameters as individual environment variables, prefixed
SM_HP
as shown in the following code (you must explicitly specify the correct input type for these variables):
- Option 3 – Read hyperparameters as AWS CLI arguments using
parse.args
:
Run distributed training jobs
SageMaker supports distributed training to support training for deep learning tasks, such as natural language processing and computer vision, to run secure, scalable data in parallel and model parallel jobs. This is usually achieved by providing the correct set of parameters when using an Estimator. For example, to use torchrun
would you define the distribution
parameter in the PyTorch Estimator and set it to "torch_distributed": {"enabled": True}
.
The ModelTrainer class provides improved flexibility for users to specify custom commands directly through the command
parameter in the SourceCode
class and supports torchrun
, torchrun smp
and IPM strategies. This capability is particularly useful when you need to start a job with a custom start command that is not supported by the training toolkit.
In the following example, we show how to adjust the latest version Meta Llama 3.1 8B model using the default startup script using Torchrun on a custom data set that is preprocessed and saved to an amazon Simple Storage Service (amazon S3) location:
If you would like to personalize your torchrun
startup script, you can also directly provide the commands using the command
parameter:
For more examples and end-to-end machine learning workflows using SageMaker ModelTrainer, see the <a target="_blank" href="https://github.com/aws/amazon-sagemaker-examples/tree/default/%20%20%20%20%20%20build_and_train_models/sm-model_trainer” target=”_blank” rel=”noopener”>GitHub repository.
Conclusion
The newly released SageMaker ModelTrainer class simplifies the user experience by reducing the number of parameters, introducing intuitive configurations, and supporting complex configurations such as bringing your own container and running distributed training. Data scientists can also seamlessly transition from local training to remote training to multi-node training using ModelTrainer.
We recommend that you test the ModelTrainer class by consulting the SDK Documentation and sample notebooks on the <a target="_blank" href="https://github.com/aws/amazon-sagemaker-examples/tree/default/%20%20%20%20%20%20build_and_train_models/sm-model_trainer” target=”_blank” rel=”noopener”>GitHub repository. The ModelTrainer class is available from SageMaker SDK v2.x onwards, at no additional cost. In Part 2 of this series, we showed you how to build a model and deploy it to a SageMaker endpoint using the enhanced ModelBuilder class.
About the authors
Durga Sury is a Senior Solutions Architect on the amazon SageMaker team. Over the past five years, he has worked with several enterprise clients to set up a secure and scalable ai/ML platform built on SageMaker.
Shweta Singh He is a Senior Product Manager on the amazon SageMaker Machine Learning (ML) Platform team on AWS and leads the SageMaker Python SDK. He has worked in various product roles at amazon for over 5 years. He has a bachelor's degree in Computer Engineering and a Master of Science in Financial Engineering, both from New York University.