Today, we are pleased to announce that Meta's Llama 3.3 70B is available in amazon SageMaker JumpStart. Flame 3.3 70B marks an exciting advance in the development of large language models (LLM), offering performance comparable to larger versions of Llama with fewer computational resources.
In this post, we explore how to efficiently implement this model in amazon SageMaker ai, using advanced SageMaker ai features for optimal performance and cost management.
Llama 3.3 70B Model Overview
Flame 3.3 70B represents a significant advancement in model efficiency and performance optimization. This new model offers comparable output quality to the Llama 3.1 405B and requires only a fraction of the computational resources. According to Meta, this efficiency gain translates into almost five times more profitable inference operations, making it an attractive option for production deployments.
The sophisticated architecture of the model is based on Meta's optimized version of the transformer design, presents an improved attention mechanism that can help substantially reduce inference costs. During its development, the Meta engineering team trained the model on an extensive data set comprising approximately 15 billion tokens, incorporating web-sourced content and over 25 million synthetic examples created specifically for LLM development. This comprehensive training approach results in strong model generation and understanding capabilities across various tasks.
What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised tuning process, complemented by reinforcement learning from human feedback (RLHF). This dual-focus training strategy helps align model results more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, falling behind Llama 3.1 405B by less than 2% in 6 of 10 standard ai benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.
The following figure summarizes the results of the comparative tests (fountain).
Getting started with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained baseline models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them to production using the user interface or the SDK.
Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using SageMaker JumpStart's intuitive user interface or deploying programmatically through the SageMaker Python SDK. Let's explore both methods to help you choose the approach that best suits your needs.
Deploy Llama 3.3 70B via the SageMaker JumpStart UI
You can access the SageMaker JumpStart user interface through amazon SageMaker Unified Studio or amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:
- In SageMaker Unified Studio, in the Build menu, choose JumpStart Models.
Alternatively, in the SageMaker Studio console, choose Begin in the navigation panel.
- Look for Meta Llama 3.3 70B.
- Choose the Meta Llama 3.3 70B model.
- Choose Deploy.
- Accept the end user license agreement (EULA).
- For instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
- Choose Deploy.
Wait until the terminal status is displayed as In service. You can now run inferences using the model.
Deploy Llama 3.3 70B using the SageMaker Python SDK
For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:
Set up auto scaling and reduce it to zero
Optionally, you can configure auto-scaling to reduce to zero after deployment. For more information, see Unlock cost savings with the new zeroing feature in SageMaker Inference.
Optimize deployment with SageMaker ai
SageMaker ai simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and profitability. With the advanced capabilities of SageMaker ai, organizations can deploy and manage LLM in production environments, taking full advantage of the efficiency of Llama 3.3 70B while benefiting from SageMaker ai's streamlined deployment process and optimization tools. The default deployment through SageMaker JumpStart uses an accelerated deployment, which uses speculative decoding to improve performance. To learn more about how speculative decoding works with SageMaker ai, see amazon SageMaker Releases Updated Inference Optimization Toolkit for Generative ai.
First, Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight transmission mechanism. This feature fundamentally changes the way model weights are loaded into accelerators, dramatically reducing the time needed to prepare the model for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from amazon Simple Storage Service (amazon S3) to the accelerator, enabling faster startup and scaling times.
One SageMaker inference capability is container caching, which transforms how model containers are managed during scaling operations. This feature removes one of the major obstacles to scaling your deployment by pre-caching container images, eliminating the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be considerable in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.
Another key capability is Scale to Zero. Introduces intelligent resource management that automatically adjusts computing capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to be fully scaled down during periods of downtime while maintaining the ability to scale quickly when demand returns. This capability is particularly valuable for organizations that run multiple models or handle varying workload patterns.
Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B's efficient architecture while providing robust tools to manage operational costs and performance.
Conclusion
The combination of Llama 3.3 70B with the advanced inference capabilities of SageMaker ai provides an optimal solution for production deployments. By utilizing Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve high performance and cost-effectiveness in their LLM deployments.
We encourage you to try this implementation and share your experiences.
About the authors
marc karp is a machine learning architect on the amazon SageMaker service team. It focuses on helping customers design, deploy, and manage machine learning workloads at scale. In his free time he likes to travel and explore new places.
Saurabh Trikande is a senior product manager for amazon Bedrock and SageMaker Inference. He is passionate about working with clients and partners, motivated by the goal of democratizing ai. It focuses on the key challenges related to deploying complex ai applications, inference with multi-tenant models, optimizing costs, and making the deployment of generative ai models more accessible. In his free time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Melanie LiPhD, is a Senior Solutions Architect specializing in Generative ai at AWS based in Sydney, Australia, where she focuses on working with customers to build solutions that leverage next-generation ai and machine learning tools. He has been actively involved in multiple generative ai initiatives at APJ, leveraging the power of large language models (LLM). Prior to joining AWS, Dr. Li held data science positions in the financial and retail industries.
Adriana Simmons is a senior product marketing manager at AWS.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS specializing in machine learning optimization, model acceleration, and ai security. It focuses on improving efficiency, reducing costs, and creating safe ecosystems to democratize ai technologies, making cutting-edge machine learning accessible and impactful across industries.
Yotam Moss is a software development manager for inference at AWS ai.