ai-microservices-for-developers” target=”_blank” rel=”noopener”>Nvidia ai-microservices-for-developers” target=”_blank” rel=”noopener”>NIM ai-microservices-for-developers” target=”_blank” rel=”noopener”>meterai-microservices-for-developers” target=”_blank” rel=”noopener”>microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can implement next-generation LLM in minutes instead of days using technologies like NVIDIA TensorRT, NVIDIA TensorRT-LLMand ai-data-science/products/triton-inference-server/” target=”_blank” rel=”noopener”>NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker.
NIM, part of ai-enterprise/” target=”_blank” rel=”noopener”>NVIDIA ai Company The AWS listed software platform is a set of inference microservices that bring the power of next-generation LLM to your applications, providing natural language processing (NLP) and understanding capabilities, whether you are developing chatbots or summarizing documents. or implement other NLP-based applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for fast deployment or use NIM tools to create your own containers.
In this post, we provide a high-level introduction to NIM and show how you can use it with SageMaker.
An introduction to NVIDIA NIM
NIM provides pre-built, optimized engines for a variety of popular inference models. These microservices support a variety of LLMs, such as Llama 2 (7B, 13B and 70B), Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona and Code Llama 70B, ready to use using pre-built NVIDIA TensorRT engines Designed for specific NVIDIA GPUs for maximum performance and utilization. These models are selected with the optimal hyperparameters for model hosting performance to deploy applications with ease.
If your model is not in NVIDIA's select model suite, NIM offers essential utilities such as Model Repo Generator, which makes it easy to create a TensorRT-LLM-accelerated engine and model directory in NIM format via a simple YAML file . Additionally, an integrated vLLM community backend provides support for edge models and emerging features that may not have been seamlessly integrated into the TensorRT-LLM optimized stack.
In addition to creating inference-optimized LLMs, NIM provides advanced hosting technologies, such as optimized scheduling techniques such as in-flight batching, that can split the overall text generation process for an LLM into multiple iterations in the model. With batch processing in progress, instead of waiting for the entire batch to finish before moving on to the next set of requests, the NIM runtime immediately evicts the finished sequences from the batch. The runtime then begins executing new requests while other requests are still in process, making the most of its compute instances and GPU.
Implementing NIM in SageMaker
NIM integrates with SageMaker, allowing you to host your LLMs with cost and performance optimization while benefiting from SageMaker capabilities. When you use NIM in SageMaker, you can use capabilities like scaling the number of instances to host your model, perform blue/green deployments, and evaluate workloads using parallel testing, all with best-in-class observability and monitoring with Amazon CloudWatch. .
Conclusion
Using NIM to implement optimized LLMs can be a great option both in terms of performance and cost. It also helps make LLM implementation easy. In the future, NIM will also enable efficient parameter fine-tuning (PEFT) customization methods, such as LoRA and P-tuning. NIM also plans to have LLM support by supporting Triton Inference Server, TensorRT-LLM, and vLLM backends.
We encourage you to learn more about NVIDIA microservices and how to deploy your LLMs using SageMaker and try out the benefits available to you. NIM is available as a paid offering as part of the NVIDIA ai Enterprise software subscription available on AWS Marketplace.
In the near future, we will publish a detailed guide to NIM in SageMaker.
About the authors
james park He is a solutions architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS and has a particular interest in artificial intelligence and machine learning. In his free time he likes to seek out new cultures, new experiences and stay up to date with the latest technological trends. He can find it at LinkedIn.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. She is passionate about working with clients and motivated by the goal of democratizing machine learning. She focuses on key challenges related to deploying complex machine learning applications, multi-tenant machine learning models, cost optimizations, and how to make deploying deep learning models more accessible. In her free time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with her family.
Qinglan He is a software development engineer at AWS. He has been working on several challenging products at Amazon, including high-performance machine learning inference solutions and high-performance systems of record. Qing's team successfully launched the first billion-parameter model on Amazon Advertising with very low latency required. Qing has deep knowledge of infrastructure optimization and deep learning acceleration.
Nikhil Kulkarni is an AWS Machine Learning software developer focused on making machine learning workloads more performant in the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He is passionate about distributed deep learning systems. Outside of work, he enjoys reading books, playing the guitar, and making pizza.
Harish Tummalacherla is a software engineer on the deep learning performance team at SageMaker. He works on performance engineering to deliver large language models efficiently in SageMaker. In his free time he enjoys running, biking and mountain skiing.
Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, training Amazon's ai MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA compute stack to accelerate and optimize Generative ai Foundation models ranging from data curation , GPU training, model inference, and production deployment on AWS GPU instances. . Additionally, Eliuth is an avid mountain biker, skier, tennis player, and poker player.
Jia Hong Liu is a Solutions Architect on NVIDIA's Cloud Service Provider team. He helps clients adopt ai and machine learning solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his free time, he likes origami, DIY projects, and playing basketball.
Kshitiz Gupta is a solutions architect at NVIDIA. He enjoys educating cloud customers about the GPU ai technologies NVIDIA has to offer and helping them accelerate their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.