Stream responses from large language models in Amazon SageMaker JumpStart

We’re excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to view the model response output as it is generated rather than waiting for LLMs to finish generating the response before it is available for use or viewing. SageMaker JumpStart’s streaming capability can help you create applications with a better user experience by creating a perception of low latency for the end user.

In this post, we explain how to implement and stream the response from a Falcon 7B Instruction Model final point.

As of this writing, the following LLMs available in SageMaker JumpStart support streaming:

Mistral ai 7B, Mistral ai 7B Instruct
Falcon 180B, Falcon 180B Talk
Falcon 40B, Falcon 40B Instruct
Falcon 7B, Falcon 7B Instruct
Rinna Japanese GPT NeoX 4B PPO Instruction
Rinna Japanese GPT NeoX 3.6B PPO Instruction

To check for updates to the list of models that support streaming in SageMaker JumpStart, search for “huggingface-llm” in Integrated algorithms with pre-trained model table.

Note that you can use the Amazon SageMaker hosting streaming feature out of the box for any model deployed using the SageMaker TGI deep learning container (DLC) as described in Announcing the Release of New Hugging Face LLM Inference Containers on Amazon SageMaker.

Foundation models in SageMaker

SageMaker JumpStart provides access to a variety of models from popular model hubs, including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your machine learning development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be tailored to a broad category of use cases, such as text summarization, digital art generation, and language translation. Because these models are expensive to train, customers prefer to use basic pre-trained models and tune them as needed, rather than training these models themselves. SageMaker provides a curated list of models that you can choose from in the SageMaker console.

You can now find base models from different model providers within SageMaker JumpStart, allowing you to get up and running with base models quickly. SageMaker JumpStart offers basic models based on different tasks or model providers, and you can easily review model features and terms of use. You can also test these models using a test UI widget. When you want to use a basic scale model, you can do so without leaving SageMaker by using pre-designed notebooks from model vendors. Because the models are hosted and deployed on AWS, you trust that your data, whether used to evaluate or use the model at scale, will not be shared with third parties.

Token transmission

Token passing allows the inference response to be returned as it is generated by the model. This way you can see the response generated incrementally instead of waiting for the model to finish before providing the full response. Streaming can help enable a better user experience because it decreases the perception of latency for the end user. You can start seeing the result as it is generated and can therefore stop the generation early if the result does not seem useful for your purposes. Streaming can make a big difference, especially for long-running queries, because you can start seeing results as they are generated, which can create a perception of lower latency even though end-to-end latency is still the same. same.

As of this writing, you can use streaming in SageMaker JumpStart for models using Hugging Face LLM Text generation inference Downloadable content.

Response without steam	Response with Streaming

Solution Overview

For this post, we use the Falcon 7B Instruct model to show the streaming capabilities of SageMaker JumpStart.

You can use the following code to find other models in SageMaker JumpStart that support streaming:

from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)

We get the following model IDs that support streaming:

('huggingface-llm-bilingual-rinna-4b-instruction-ppo-bf16', 'huggingface-llm-falcon-180b-bf16', 'huggingface-llm-falcon-180b-chat-bf16', 'huggingface-llm-falcon-40b-bf16', 'huggingface-llm-falcon-40b-instruct-bf16', 'huggingface-llm-falcon-7b-bf16', 'huggingface-llm-falcon-7b-instruct-bf16', 'huggingface-llm-mistral-7b', 'huggingface-llm-mistral-7b-instruct', 'huggingface-llm-rinna-3-6b-instruction-ppo-bf16')

Previous requirements

Before running the laptop, some initial setup steps are required. Run the following commands:

%pip install --upgrade sagemaker –quiet

Implement the model

As a first step, use SageMaker JumpStart to deploy a Falcon 7B Instruct model. For complete instructions, see TII’s Falcon 180B base model now available through Amazon SageMaker JumpStart. Use the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()

Query the endpoint and transmit the response

Next, create a payload to invoke your deployed endpoint. Importantly, the payload must contain the key/value pair. "stream": True. This instructs the text generation inference server to generate a broadcast response.

payload = {
    "inputs": "How do I build a website?",
    "parameters": {"max_new_tokens": 256},
    "stream": True
}

Before querying the endpoint, you must create an iterator that can parse the byte stream response from the endpoint. The data for each token is provided as a separate line in the response, so this iterator returns a token each time a new line is identified in the transmit buffer. This iterator has a minimal design and you may want to adjust its behavior to your use case; For example, although this iterator returns token strings, the line data contains other information, such as token log probabilities, that might be of interest.

import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line(-1) == ord("\n"):
                self.read_pos += len(line) + 1
                full_line = line(:-1).decode("utf-8")
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data("token")("text")
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk("PayloadPart")("Bytes"))

Now you can use the Boto3 invoke_endpoint_with_response_stream API on the endpoint you created and enable streaming by iterating over a TokenIterator instance:

import boto3

client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)

for token in TokenIterator(response("Body")):
    print(token, end="")

Specifying a void end parameter to print The feature will allow visual sequencing without inserting new line characters. This produces the following result:

Building a website can be a complex process, but it generally involves the following steps:

1. Determine the purpose and goals of your website
2. Choose a domain name and hosting provider
3. Design and develop your website using HTML, CSS, and JavaScript
4. Add content to your website and optimize it for search engines
5. Test and troubleshoot your website to ensure it is working properly
6. Maintain and update your website regularly to keep it running smoothly.

There are many resources available online to guide you through these steps, including tutorials and templates. It may also be helpful to seek the advice of a web developer or designer if you are unsure about any of these steps.<|endoftext|>

You can use this code on a laptop or other apps like Streamlit or Gradio to see streaming in action and the experience it provides to your customers.

Clean

Finally, remember to clean up your deployed model and its endpoint to avoid incurring additional costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we show you how to use the newly released streaming feature in SageMaker JumpStart. We hope you will use the token transmission capability to create interactive applications that require low latency for a better user experience.

About the authors

Rachna Chadha is a Principal ai/ML Solutions Architect in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of ai can improve society in the future and generate economic and social prosperity. In her free time, Rachna enjoys spending time with her family, hiking, and listening to music.

Dr. Kyle Ulrich is an applied scientist on the Amazon SageMaker Integrated Algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, non-parametric Bayesian processes, and Gaussian processes. His PhD is from Duke University and he has published articles in NeurIPS, Cell, and Neuron.

Dr Ashish Khaitan is a Senior Applied Scientist with Amazon SageMaker integrated algorithms and helps develop machine learning algorithms. He earned his doctorate from the University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published numerous papers at NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Stream responses from large language models in Amazon SageMaker JumpStart

Technical Terrence Team

Under Armor's strong margin forecast blurs North American weakness By Reuters

Leave a Reply Cancel reply

Recommended.

Boerse Stuttgart Stock Exchange Leverages BaFIN License to Offer Crypto Custody

FromSoftware parent company confirms Sony acquisition interest

Cardano Price Eyes 70% jumps when Hoskinson explodes Ethereum, Solana

ETH hits 8-month high as BTC approaches key resistance level – Market Updates Bitcoin News

Pyramid Attention Broadcast: the breakthrough that makes real-time AI videos possible

Categories

Important Links

Stream responses from large language models in Amazon SageMaker JumpStart

Foundation models in SageMaker

Token transmission

Solution Overview

Previous requirements

Implement the model

Query the endpoint and transmit the response

Clean

Conclusion

About the authors

Related

Technical Terrence Team

Under Armor's strong margin forecast blurs North American weakness By Reuters

Leave a Reply Cancel reply

Recommended.

Boerse Stuttgart Stock Exchange Leverages BaFIN License to Offer Crypto Custody

FromSoftware parent company confirms Sony acquisition interest

Cardano Price Eyes 70% jumps when Hoskinson explodes Ethereum, Solana

ETH hits 8-month high as BTC approaches key resistance level – Market Updates Bitcoin News

Pyramid Attention Broadcast: the breakthrough that makes real-time AI videos possible

Categories

Important Links

Get daily news updates to your inbox!