Accelerate your machine learning lifecycle with the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

amazon SageMaker has redesigned its Python SDK to provide a unified object-oriented interface that makes it easy to interact with SageMaker services. The new SDK is designed with a tiered user experience in mind, where the new lower-level SDK (SageMaker Core) provides access to a full range of SageMaker features and settings, allowing greater flexibility and control for engineers. by ML. The top-level abstract layer is designed for data scientists with limited AWS experience and offers a simplified interface that hides complex infrastructure details.

In this two-part series, we introduce the SageMaker Python SDK abstract layer that allows you to train and deploy machine learning (ML) models using the new ModelTrainer and improved ModelBuilder classes.

In this post, we focus on the ModelTrainer class to simplify the training experience. The ModelTrainer class provides significant improvements over the current Estimator class, which are discussed in detail in this post. We show you how to use the ModelTrainer class to train your ML models, including running distributed training using a custom script or container. In Part 2, we showed you how to create a model and deploy it to a SageMaker endpoint using the enhanced ModelBuilder class.

Benefits of the ModelTrainer class

The new ModelTrainer class has been designed to address the usability challenges associated with the Estimator class. In the future, ModelTrainer will be the preferred approach for model training, bringing significant improvements that greatly improve the user experience. This evolution marks a step toward achieving the best developer experience for model training. The following are the key benefits:

Improved intuition – The ModelTrainer class reduces complexity by consolidating configurations into just a few main parameters. This optimization minimizes cognitive overhead, allowing users to focus on model training rather than the complexities of configuration. Additionally, it employs intuitive configuration classes for easy interactions with the platform.
Simplified script mode and BYOC – The transition from local development to cloud training is now seamless. ModelTrainer automatically maps source code, data paths, and parameter specifications to the remote execution environment, eliminating the need for special handshakes or complex configuration processes.
Distributed training made easy – The ModelTrainer class provides improved flexibility for users to specify custom commands and distributed training strategies, allowing you to directly supply the exact command you want to run in your container via the command parameter in the SourceCode This approach decouples distributed training strategies from the training toolset and framework-specific estimators.
Improved hyperparameter contracts – The ModelTrainer class passes the training job's hyperparameters as a single environment variable, allowing you to load the hyperparameters using a single SM_HPSvariable.

To explain each of these benefits in more detail, we demonstrate with examples in the following sections and finally show you how to set up and run distributed training for the Meta Llama 3.1 8B model using the new ModelTrainer class.

Launch a training job using the ModelTrainer class

The ModelTrainer class simplifies the experience by allowing you to customize the training job, including providing a custom script, directly providing a command to run the training job, supporting local mode, and much more. However, you can start a SageMaker training job in script mode by providing minimal parameters: SourceCode and the URI of the training image.

The following example illustrates how you can start a training job with your own custom script by providing only the script and the URI of the training image (in this case, PyTorch) and an optional requirements file. Additional parameters, such as instance type and size, are automatically set by the SDK to preset default values, and parameters such as AWS Identity and Access Management (IAM) role and SageMaker session are automatically detected from the session. current and user credentials. Administrators and users can also override the default values using the SDK Defaults configuration file. For a detailed list of preset values, see the SDK Documentation.

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, InputData

# image URI for the training job
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"
# you can find all available images here
# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html

# define the script to be run
source_code = SourceCode(
    source_dir="basic-script-mode",
    requirements="requirements.txt",
    entry_script="custom_script.py",
)

# define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="script-mode",
)

# pass the input data
input_data = InputData(
    channel_name="train",
    data_source=training_input_path,  #s3 path where training data is stored
)

# start the training job
model_trainer.train(input_data_config=(input_data), wait=False)

With specifically designed configurations, you can now reuse these objects to create multiple training jobs with different hyperparameters, for example, without having to redefine all the parameters.

Run the job locally to experiment.

To run the above training job locally, you can simply set the training_mode parameter as shown in the following code:

from sagemaker.modules.train.model_trainer import Mode

...
model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="script-mode-local",
    training_mode=Mode.LOCAL_CONTAINER,
)
model_trainer.train()

Training work is done remotely because training_mode is configured to Mode.LOCAL_CONTAINER. If not explicitly configured, ModelTrainer runs a SageMaker remote training job by default. This behavior can also be applied by changing the value to Mode.SAGEMAKER_TRAINING_JOB. For a complete list of available configurations, including computing and networking, see the SDK Documentation.

Read the hyperparameters in your custom script

ModelTrainer supports several ways to read hyperparameters passed to a training job. In addition to existing support for reading hyperparameters as command line arguments in your custom script, ModelTrainer also supports reading hyperparameters as individual environment variables, prefixed SM_HPS_or as a single environment variable dictionary, SM_HPS.

Suppose the following hyperparameters are passed to the training job:

hyperparams = {
    "learning_rate": 1e-5,
    "epochs": 2,
}

model_trainer = ModelTrainer(
    ...
    hyperparameters=hyperparams,
    ...
)

You have the following options:

Option 1 – Load the hyperparameters into a single JSON dictionary using the SM_HPS environment variable in your custom script:

def main():
    hyperparams = json.loads(os.environ("SM_HPS"))
    learning_rate = hyperparams.get("learning_rate")
    epochs = hyperparams.get("epochs", 1)
    ...

Option 2 – Read hyperparameters as individual environment variables, prefixed SM_HP as shown in the following code (you must explicitly specify the correct input type for these variables):

def main():
    learning_rate = float(os.environ.get("SM_HP_LEARNING_RATE", 3e-5))
    epochs = int(os.environ.get("SM_HP_EPOCHS", 1)
    ...

Option 3 – Read hyperparameters as AWS CLI arguments using parse.args:

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--learning_rate", type=float, default=3e-5)
    parser.add_argument("--epochs", type=int, default=1)
    
    args = parse_args()
    
    learning_rate = args.learning_rate
    epochs = args.epochs

Run distributed training jobs

SageMaker supports distributed training to support training for deep learning tasks, such as natural language processing and computer vision, to run secure, scalable data in parallel and model parallel jobs. This is usually achieved by providing the correct set of parameters when using an Estimator. For example, to use torchrunwould you define the distribution parameter in the PyTorch Estimator and set it to "torch_distributed": {"enabled": True}.

The ModelTrainer class provides improved flexibility for users to specify custom commands directly through the command parameter in the SourceCode class and supports torchrun, torchrun smpand IPM strategies. This capability is particularly useful when you need to start a job with a custom start command that is not supported by the training toolkit.

In the following example, we show how to adjust the latest version Meta Llama 3.1 8B model using the default startup script using Torchrun on a custom data set that is preprocessed and saved to an amazon Simple Storage Service (amazon S3) location:

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.configs import Compute, SourceCode, InputData

# provide  image URI - update the URI if you're in a different region
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310"

# Define the source code configuration for the distributed training job
source_code = SourceCode(
    source_dir="distributed-training-scripts",    
    requirements="requirements.txt",  
    entry_point="fine_tune.py",
)

torchrun = Torchrun()

hyperparameters = {
    ...
}

# Compute configuration for the training job
compute = Compute(
    instance_count=1,
    instance_type="ml.g5.12xlarge",
    volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600,
)


# Initialize the ModelTrainer with the specified configurations
model_trainer = ModelTrainer(
    training_image=pytorch_image,  
    source_code=source_code,
    compute=compute,
    distributed_runner=torchrun,
    hyperparameters=hyperparameters,
)

# pass the input data
input_data = InputData(
    channel_name="dataset",
    data_source="s3://your-bucket/your-prefix",  # this is the s3 path where processed data is stored
)

# Start the training job
model_trainer.train(input_data_config=(input_data), wait=False)

If you would like to personalize your torchrun startup script, you can also directly provide the commands using the command parameter:

# Define the source code configuration for the distributed training job
source_code = SourceCode(
    source_dir="distributed-training-scripts",    
    requirements="requirements.txt",    
    # Custom command for distributed training launcher script
    command="torchrun --nnodes 1 \
            --nproc_per_node 4 \
            --master_addr algo-1 \
            --master_port 7777 \
            fine_tune_llama.py"
)


# Initialize the ModelTrainer with the specified configurations
model_trainer = ModelTrainer(
    training_image=pytorch_image,  
    source_code=source_code,
    compute=compute,
)

# Start the training job
model_trainer.train(..)

For more examples and end-to-end machine learning workflows using SageMaker ModelTrainer, see the <a target="_blank" href="https://github.com/aws/amazon-sagemaker-examples/tree/default/%20%20%20%20%20%20build_and_train_models/sm-model_trainer” target=”_blank” rel=”noopener”>GitHub repository.

Conclusion

The newly released SageMaker ModelTrainer class simplifies the user experience by reducing the number of parameters, introducing intuitive configurations, and supporting complex configurations such as bringing your own container and running distributed training. Data scientists can also seamlessly transition from local training to remote training to multi-node training using ModelTrainer.

We recommend that you test the ModelTrainer class by consulting the SDK Documentation and sample notebooks on the <a target="_blank" href="https://github.com/aws/amazon-sagemaker-examples/tree/default/%20%20%20%20%20%20build_and_train_models/sm-model_trainer” target=”_blank” rel=”noopener”>GitHub repository. The ModelTrainer class is available from SageMaker SDK v2.x onwards, at no additional cost. In Part 2 of this series, we showed you how to build a model and deploy it to a SageMaker endpoint using the enhanced ModelBuilder class.

About the authors

Durga Sury is a Senior Solutions Architect on the amazon SageMaker team. Over the past five years, he has worked with several enterprise clients to set up a secure and scalable ai/ML platform built on SageMaker.

Shweta Singh He is a Senior Product Manager on the amazon SageMaker Machine Learning (ML) Platform team on AWS and leads the SageMaker Python SDK. He has worked in various product roles at amazon for over 5 years. He has a bachelor's degree in Computer Engineering and a Master of Science in Financial Engineering, both from New York University.

Accelerate your machine learning lifecycle with the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

Technical Terrence Team

Norwegian stock markets close lower; Oslo OBX falls 0.21% By Investing.com

Leave a Reply Cancel reply

Recommended.

Royal Caribbean and Carnival follow controversial policy

Request to speak in sessions: ai before the deadline

US says it will consider expansion of emergency lending for banks: report

Australia's largest pension fund makes historic $17 million investment in Bitcoin, a national first

Path to approval? The first Ethereum Spot ETF has arrived on the DTCC website

Categories

Important Links

Accelerate your machine learning lifecycle with the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

Benefits of the ModelTrainer class

Launch a training job using the ModelTrainer class

Run the job locally to experiment.

Read the hyperparameters in your custom script

Run distributed training jobs

Conclusion

About the authors

Related

Technical Terrence Team

Norwegian stock markets close lower; Oslo OBX falls 0.21% By Investing.com

Leave a Reply Cancel reply

Recommended.

Royal Caribbean and Carnival follow controversial policy

Request to speak in sessions: ai before the deadline

US says it will consider expansion of emergency lending for banks: report

Australia's largest pension fund makes historic $17 million investment in Bitcoin, a national first

Path to approval? The first Ethereum Spot ETF has arrived on the DTCC website

Categories

Important Links

Get daily news updates to your inbox!