We’re excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to view the model response output as it is generated rather than waiting for LLMs to finish generating the response before it is available for use or viewing. SageMaker JumpStart’s streaming capability can help you create applications with a better user experience by creating a perception of low latency for the end user.
In this post, we explain how to implement and stream the response from a Falcon 7B Instruction Model final point.
As of this writing, the following LLMs available in SageMaker JumpStart support streaming:
- Mistral ai 7B, Mistral ai 7B Instruct
- Falcon 180B, Falcon 180B Talk
- Falcon 40B, Falcon 40B Instruct
- Falcon 7B, Falcon 7B Instruct
- Rinna Japanese GPT NeoX 4B PPO Instruction
- Rinna Japanese GPT NeoX 3.6B PPO Instruction
To check for updates to the list of models that support streaming in SageMaker JumpStart, search for “huggingface-llm” in Integrated algorithms with pre-trained model table.
Note that you can use the Amazon SageMaker hosting streaming feature out of the box for any model deployed using the SageMaker TGI deep learning container (DLC) as described in Announcing the Release of New Hugging Face LLM Inference Containers on Amazon SageMaker.
Foundation models in SageMaker
SageMaker JumpStart provides access to a variety of models from popular model hubs, including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your machine learning development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be tailored to a broad category of use cases, such as text summarization, digital art generation, and language translation. Because these models are expensive to train, customers prefer to use basic pre-trained models and tune them as needed, rather than training these models themselves. SageMaker provides a curated list of models that you can choose from in the SageMaker console.
You can now find base models from different model providers within SageMaker JumpStart, allowing you to get up and running with base models quickly. SageMaker JumpStart offers basic models based on different tasks or model providers, and you can easily review model features and terms of use. You can also test these models using a test UI widget. When you want to use a basic scale model, you can do so without leaving SageMaker by using pre-designed notebooks from model vendors. Because the models are hosted and deployed on AWS, you trust that your data, whether used to evaluate or use the model at scale, will not be shared with third parties.
Token transmission
Token passing allows the inference response to be returned as it is generated by the model. This way you can see the response generated incrementally instead of waiting for the model to finish before providing the full response. Streaming can help enable a better user experience because it decreases the perception of latency for the end user. You can start seeing the result as it is generated and can therefore stop the generation early if the result does not seem useful for your purposes. Streaming can make a big difference, especially for long-running queries, because you can start seeing results as they are generated, which can create a perception of lower latency even though end-to-end latency is still the same. same.
As of this writing, you can use streaming in SageMaker JumpStart for models using Hugging Face LLM Text generation inference Downloadable content.
Response without steam | Response with Streaming |
Solution Overview
For this post, we use the Falcon 7B Instruct model to show the streaming capabilities of SageMaker JumpStart.
You can use the following code to find other models in SageMaker JumpStart that support streaming:
We get the following model IDs that support streaming:
Previous requirements
Before running the laptop, some initial setup steps are required. Run the following commands:
Implement the model
As a first step, use SageMaker JumpStart to deploy a Falcon 7B Instruct model. For complete instructions, see TII’s Falcon 180B base model now available through Amazon SageMaker JumpStart. Use the following code:
Query the endpoint and transmit the response
Next, create a payload to invoke your deployed endpoint. Importantly, the payload must contain the key/value pair. "stream": True
. This instructs the text generation inference server to generate a broadcast response.
Before querying the endpoint, you must create an iterator that can parse the byte stream response from the endpoint. The data for each token is provided as a separate line in the response, so this iterator returns a token each time a new line is identified in the transmit buffer. This iterator has a minimal design and you may want to adjust its behavior to your use case; For example, although this iterator returns token strings, the line data contains other information, such as token log probabilities, that might be of interest.
Now you can use the Boto3 invoke_endpoint_with_response_stream
API on the endpoint you created and enable streaming by iterating over a TokenIterator
instance:
Specifying a void end
parameter to print
The feature will allow visual sequencing without inserting new line characters. This produces the following result:
You can use this code on a laptop or other apps like Streamlit or Gradio to see streaming in action and the experience it provides to your customers.
Clean
Finally, remember to clean up your deployed model and its endpoint to avoid incurring additional costs:
Conclusion
In this post, we show you how to use the newly released streaming feature in SageMaker JumpStart. We hope you will use the token transmission capability to create interactive applications that require low latency for a better user experience.
About the authors
Rachna Chadha is a Principal ai/ML Solutions Architect in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of ai can improve society in the future and generate economic and social prosperity. In her free time, Rachna enjoys spending time with her family, hiking, and listening to music.
Dr. Kyle Ulrich is an applied scientist on the Amazon SageMaker Integrated Algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, non-parametric Bayesian processes, and Gaussian processes. His PhD is from Duke University and he has published articles in NeurIPS, Cell, and Neuron.
Dr Ashish Khaitan is a Senior Applied Scientist with Amazon SageMaker integrated algorithms and helps develop machine learning algorithms. He earned his doctorate from the University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published numerous papers at NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.