Create a serverless meeting summary backend with large language models in Amazon SageMaker JumpStart

AWS offers services that meet customers’ artificial intelligence (AI) and machine learning (ML) needs with services ranging from custom hardware like AWS Trainium and AWS Inferentia to basic generative AI (FM) models on Amazon Bedrock. In February 2022, AWS and Hugging Face announced a collaboration to make generative AI more accessible and cost-effective.

Generative AI has grown at a rapid pace from the largest pretrained model in 2019 with 330 million parameters to over 500 billion parameters today. The performance and quality of the models also improved dramatically with the number of parameters. These models cover tasks like text to text, text to image, text to embed, and more. You can use large language models (LLMs), more specifically, for tasks that include summaries, metadata extraction, and answering questions.

Amazon SageMaker JumpStart is an ML hub that can help you accelerate your ML journey. With JumpStart, you can access pretrained models and base models from the Foundations Model Hub to perform tasks such as article summaries and image generation. Pretrained models are fully customizable for your use cases and can be easily deployed to production with the UI or SDK. Most importantly, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave the Virtual Private Cloud (VPC), you can trust that your data will stay private and confidential.

This post focuses on creating a serverless meeting summary using Amazon Transcribe to transcribe meeting audio and Custard-T5-XL Hugging Face model (available on JumpStart) to summarize.

Solution Overview

Meeting Notes Generator Solution creates a serverless automated pipeline using AWS Lambda to transcribe and summarize audio and video recordings of meetings. The solution can be implemented with other FMs available in JumpStart.

The solution includes the following components:

A shell script to create a custom Lambda layer
A configurable AWS CloudFormation template to implement the solution
Lambda function code to start Amazon Transcribe transcription jobs
Lambda function code to invoke a SageMaker real-time endpoint hosting the Flan T5 XL model

The following diagram illustrates this architecture.

As shown in the architecture diagram, meeting recordings, transcripts, and notes are stored in their respective Amazon Simple Storage Service (Amazon S3) buckets. The solution takes an event-based approach to transcribe and summarize S3 upload events. The events trigger Lambda functions to make API calls to Amazon Transcribe and invoke the real-time endpoint that hosts the Flan T5 XL model.

The CloudFormation template and instructions for implementing the solution can be found in the GitHub repository.

Real-time inference with SageMaker

Real-time inference in SageMaker is designed for workloads with low latency requirements. SageMaker endpoints are fully managed and support multiple hosting and autoscaling options. Once created, the endpoint can be invoked with the InvokeEndpoint API. The provided CloudFormation template creates a real-time endpoint with the default instance count of 1, but this can be adjusted based on the expected load on the endpoint and as allowed by the service quota for the instance type. You can request service quota increases on the Service Quotas page of the AWS Management Console.

The following CloudFormation template snippet defines the SageMaker model, endpoint configuration, and endpoint using the ModelData and ImageURI of the JumpStart T5 XL Flan. You can explore more FM in Getting Started with Amazon SageMaker JumpStart. To implement the solution with a different model, replace the ModelData and ImageURI parameters in the CloudFormation template with the S3 artifact and container image URI of the desired model, respectively. review the sample notebook on GitHub for sample code on how to retrieve the latest JumpStart model artifact in Amazon S3 and the corresponding public container image provided by SageMaker.

  # SageMaker Model
  SageMakerModel:
    Type: AWS::SageMaker::Model
    Properties:
      ModelName: !Sub ${AWS::StackName}-SageMakerModel
      Containers:
        - Image: !Ref ImageURI
          ModelDataUrl: !Ref ModelData
          Mode: SingleModel
          Environment: {
            "MODEL_CACHE_ROOT": "/opt/ml/model",
            "SAGEMAKER_ENV": "1",
            "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
            "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code/",
            "TS_DEFAULT_WORKERS_PER_MODEL": 1
          }
      EnableNetworkIsolation: true
      ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn

  # SageMaker Endpoint Config
  SageMakerEndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      EndpointConfigName: !Sub ${AWS::StackName}-SageMakerEndpointConfig
      ProductionVariants:
        - ModelName: !GetAtt SageMakerModel.ModelName
          VariantName: !Sub ${SageMakerModel.ModelName}-1
          InitialInstanceCount: !Ref InstanceCount
          InstanceType: !Ref InstanceType
          InitialVariantWeight: 1.0
          VolumeSizeInGB: 40

  # SageMaker Endpoint
  SageMakerEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointName: !Sub ${AWS::StackName}-SageMakerEndpoint
      EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName

Deploy the solution

For detailed steps on implementing the solution, follow the Implementation with CloudFormation section of the GitHub repository.

If you want to use a different instance type or more instances for your endpoint, submit a quota increase request for the desired instance type on the AWS Service Quotas panel.

To use a different FM for the end point, replace the ImageURI and ModelData parameters in the CloudFormation template for the corresponding FM.

try the solution

After implementing the solution using the Lambda layering script and the CloudFormation template, you can test the architecture by uploading an audio or video meeting recording in any of the media formats supported by Amazon Transcribe. Complete the following steps:

In the Amazon S3 console, choose cubes in the navigation pane.
From the list of S3 buckets, choose the S3 bucket created by the CloudFormation template named meeting-note-generator-demo-bucket-<aws-account-id>.
Choose Create folder.
For folder nameenter the S3 prefix specified in the S3RecordingsPrefix CloudFormation template parameter (recordings default).
Choose Create folder.
In the newly created folder, choose Increase.
Choose add files and choose the meeting recording file to upload.
Choose Increase.

Now we can check if the transcription was successful.

In the Amazon Transcribe console, choose transcription jobs in the navigation pane.
Verify that a transcription job with a name corresponding to the uploaded meeting recording has the status In progress either Complete.
When the state is CompleteGo back to the Amazon S3 console and open the demo bucket.
In the S3 bucket, open the transcripts/ file.
Download the generated text file to view the transcript.

We can also consult the generated summary.

In the S3 bucket, open the notes/ file.
Download the generated text file to view the generated summary.

rapid engineering

Although LLMs have improved in recent years, the models can only support finite inputs; therefore, inserting a full transcript of a meeting may exceed the model limit and cause the call to fail. To design around this challenge, we can break the context into manageable pieces by limiting the number of tokens in each invocation context. In this sample solution, the transcript is split into smaller chunks with a maximum limit on the number of tokens per chunk. Each transcript fragment is then summarized using the Flan T5 XL model. Finally, the snippet summaries are combined to form the context of the final combined summary, as shown in the following diagram.

The following code of the GenerateMeetingNotes The lambda function uses the Natural Language Toolkit (NLTK) library to tokenize the transcript, then split the transcript into sections, each containing up to a certain number of tokens:

# Chunk transcript into chunks
transcript = contents['results']['transcripts'][0]['transcript']
transcript_tokens = word_tokenize(transcript)

num_chunks = int(math.ceil(len(transcript_tokens) / CHUNK_LENGTH))
transcript_chunks = []
for i in range(num_chunks):
    if i == num_chunks - 1:
        chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:])
    else:
        chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:CHUNK_LENGTH * (i + 1)])
    transcript_chunks.append(chunk)

Once the transcript is broken into smaller parts, the following code invokes SageMaker’s real-time inference endpoint to get summaries of each part of the transcript:

# Summarize each chunk
chunk_summaries = []
for i in range(len(transcript_chunks)):
    text_input="{}\n{}".format(transcript_chunks[i], instruction)
    payload = {
        "text_inputs": text_input,
        "max_length": 100,
        "num_return_sequences": 1,
        "top_k": 50,
        "top_p": 0.95,
        "do_sample": True
    }
    query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))
    generated_texts = parse_response_multiple_texts(query_response)
    chunk_summaries.append(generated_texts[0])
    print(generated_texts[0])

Finally, the following code snippet combines the snippet summaries as context to generate a final summary:

# Create a combined summary
text_input="{}\n{}".format(' '.join(chunk_summaries), instruction)
payload = {
    "text_inputs": text_input,
    "max_length": 100,
    "num_return_sequences": 1,
    "top_k": 50,
    "top_p": 0.95,
    "do_sample": True
}
query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))
generated_texts = parse_response_multiple_texts(query_response)

results = {
    "summary": generated_texts,
    "chunk_summaries": chunk_summaries
}

the full GenerateMeetingNotes The Lambda function can be found in the GitHub repository.

Clean

To clean up the solution, complete the following steps:

Delete all objects in the demo S3 bucket and the log S3 bucket.
Delete the CloudFormation stack.
Delete the Lambda layer.

Conclusion

This post demonstrated how to use FM in JumpStart to quickly build a serverless meeting note generator architecture with AWS CloudFormation. Combined with AWS AI services like Amazon Transcribe and serverless technologies like Lambda, you can use FM on JumpStart and Amazon Bedrock to build applications for various generative AI use cases.

For additional posts on ML on AWS, visit the AWS ML Blog.

About the Author

eric kim is a Solution Architect (SA) at Amazon Web Services. Work with game developers and publishers to build scalable games and support services on AWS. It mainly focuses on artificial intelligence and machine learning applications.