To remain competitive, companies in all industries use basic models (FM) to transform their applications. Although FMs offer impressive out-of-the-box capabilities, achieving true competitive advantage often requires deep model customization through pre-training or tuning. However, these approaches require advanced ai expertise, high-performance computing, fast access to storage, and can be prohibitively expensive for many organizations.
In this post, we explore how organizations can address these challenges and cost-effectively customize and scale FMs using AWS managed services, such as amazon SageMaker Training Jobs and amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize computing resources and reduce the complexity of model training and tuning. We explore how you can make an informed decision about which amazon SageMaker service is most applicable to your business needs and requirements.
Business challenge
Today, businesses face numerous challenges in effectively implementing and managing machine learning (ML) initiatives. These challenges include scaling operations to handle rapidly growing data and models, accelerating the development of machine learning solutions, and managing complex infrastructures without diverting focus from core business objectives. Additionally, organizations must navigate cost optimization, maintain data security and compliance, and democratize both ease of use and access to machine learning tools across teams.
Customers have built their own machine learning architectures on commodity machines using open source solutions like Kubernetes, Slurm, and others. While this approach provides control over the infrastructure, the amount of effort required to manage and maintain the underlying infrastructure (for example, hardware failures) over time can be substantial. Organizations often underestimate the complexity of integrating these various components, maintaining security and compliance, and keeping the system up to date and optimized for performance.
As a result, many companies struggle to harness the full potential of machine learning while maintaining efficiency and innovation in a competitive landscape.
How amazon SageMaker can help
amazon SageMaker addresses these challenges by providing a fully managed service that optimizes and accelerates the entire machine learning lifecycle. You can use the full set of SageMaker tools to build and train your models at scale while offloading the management and maintenance of the underlying infrastructure to SageMaker.
You can use SageMaker to scale your training suite to thousands of accelerators, with your own choice of compute, and optimize your workloads for performance with SageMaker's distributed training libraries. For cluster resiliency, SageMaker offers self-healing capabilities that automatically detect and recover from failures, enabling continuous FM training for months with little to no interruption and reducing training time by up to 40%. SageMaker also supports popular machine learning frameworks, such as TensorFlow and PyTorch, through managed pre-built containers. For those who need more customization, SageMaker also allows users to add their own libraries or containers.
To address various technical and business use cases, amazon SageMaker offers two options for distributed pre-training and tuning: SageMaker Training Jobs and SageMaker HyperPod.
SageMaker Training Jobs
SageMaker Training Jobs offer a managed user experience for large, distributed FM training, eliminating undifferentiated heavy lifting around infrastructure management and cluster resiliency, while offering a payment option per use. SageMaker training jobs automatically spin up a resilient distributed training cluster, provide managed orchestration, monitor infrastructure, and automatically recover from failures for a seamless training experience. After training is complete, SageMaker deactivates the cluster and the customer is billed for the net training time in seconds. FM creators can further optimize this experience using SageMaker Managed Warm Pools, which allows you to retain and reuse provisioned infrastructure after completing a training job to reduce latency and speed up iteration time between different ML experiments.
With SageMaker training jobs, FM creators have the flexibility to choose the right instance type that best suits an individual to further optimize their training budget. For example, you can pre-train a large language model (LLM) on a P5 cluster or tune an open source LLM on p4d instances. This allows companies to deliver a consistent training user experience across ML teams with different levels of technical expertise and different types of workload.
Additionally, amazon SageMaker training jobs integrate tools such as SageMaker Profiler for profiling training jobs, amazon SageMaker with MLflow for managing machine learning experiments, amazon CloudWatch for monitoring and alerting, and TensorBoard for debugging and analyzing training jobs. training. Together, these tools improve model development by providing performance insights, tracking experiments, and facilitating proactive management of training processes.
AI21 Labs, technology Innovation Institute, Upstage, and Bria ai chose SageMaker Training Jobs to train and tune their FMs with reduced total cost of ownership by offloading workload orchestration and underlying compute management to SageMaker. They achieved faster results by focusing their resources on model development and experimentation, while SageMaker handled the provisioning, creation, and termination of their compute clusters.
The following demo provides a high-level, step-by-step guide to using amazon SageMaker training jobs.
HiperPod SageMaker
SageMaker HyperPod offers persistent clusters with deep infrastructure control, which developers can use to connect over Secure Shell (SSH) to amazon Elastic Compute Cloud (amazon EC2) instances for advanced model training, infrastructure management, and debugging. . To maximize availability, HyperPod maintains a set of dedicated and spare instances (at no additional cost to the customer), minimizing downtime for critical node replacements. Customers can use popular orchestration tools, such as Slurm or amazon Elastic Kubernetes Service (amazon EKS), and libraries built on top of these tools for flexible job scheduling and compute sharing. Additionally, SageMaker HyperPod cluster orchestration with Slurm enables NVIDIA's Enroot and Pyxis integration to quickly schedule containers as high-performance, unprivileged sandboxes. The operating system and software stack are based on the deep learning AMI, which are preconfigured with NVIDIA CUDA, NVIDIA CUDNNand the latest versions of PyTorch and TensorFlow. HyperPod also includes SageMaker distributed training libraries, which are optimized for AWS infrastructure, so users can automatically split training workloads across thousands of accelerators for efficient parallel training.
FM creators can use machine learning tools built into HyperPod to improve model performance, such as using amazon SageMaker with TensorBoard to visualize a model architecture and address convergence issues, while amazon SageMaker Debugger captures metrics and training profiles in real time. Additionally, integration with observability tools such as amazon CloudWatch Container Insights, amazon Managed Service for Prometheus, and amazon Managed Grafana provides deeper insights into cluster performance, health, and utilization, saving valuable development time.
This high-performance, self-healing environment, trusted by customers like Articul8, ai” target=”_blank” rel=”noopener”>IBMai Perplexity, hugging face, ai/p/dream-machine” target=”_blank” rel=”noopener”>lumaand Thomson Reuters, supports advanced machine learning workflows and internal optimizations.
The following demo provides a high-level, step-by-step guide to using amazon SageMaker HyperPod.
Choose the correct option
For organizations that require granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal choice. HyperPod offers custom network configurations, flexible parallelism strategies, and support for custom orchestration techniques. It integrates seamlessly with tools like Slurm, amazon EKS, Nvidia's Enroot, and Pyxis, and provides SSH access for in-depth debugging and custom configurations.
SageMaker training assignments are designed for organizations that want to focus on model development rather than infrastructure management and prefer ease of use with a managed experience. SageMaker training jobs feature an easy-to-use interface, simplified configuration and scaling, automatic handling of distributed training tasks, built-in synchronization, checkpointing, fault tolerance, and abstraction of infrastructure complexities.
When choosing between SageMaker HyperPod and training jobs, organizations should align their decision with their specific training needs, workflow preferences, and desired level of control over the training infrastructure. HyperPod is the preferred choice for those looking for deep technical control and extensive customization, and training jobs are ideal for organizations that prefer a streamlined, fully managed solution.
Conclusion
Learn more about amazon SageMaker and large-scale distributed training on AWS by visiting Getting Started with amazon SageMaker and viewing the Generative ai in amazon SageMaker Deep Dive Seriesand exploring the awesome distributed training and amazon-sagemaker-examples” target=”_blank” rel=”noopener”>amazon-sagemaker-examples GitHub repositories.
About the authors
Trevor Harvey is a Principal Generative ai Specialist at amazon Web Services and an AWS Certified Professional Solutions Architect. Trevor works with clients to design and implement machine learning solutions and leads commercialization strategies for generative ai services.
Kanwaljit Khurmi is a Principal Architect of Generative ai/ML Solutions at amazon Web Services. He works with AWS customers to provide guidance and technical support, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping clients with containerized and machine learning applications.
Myron Perel is a senior director of machine learning business development at amazon Web Services. Miron advises generative ai companies in building their next generation models.
Guillaume Mangeot He is a Senior Solutions Architect specializing in WW GenAI at amazon Web Services with over a decade of experience in high performance computing (HPC). With multidisciplinary expertise in applied mathematics, he leads the design of highly scalable architecture in cutting-edge fields such as GenAI, ML, HPC and storage, across various verticals including oil and gas, research, life sciences and insurance.