Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that makes it easy for you to add speech-to-text capabilities to your applications. Today, we are pleased to announce a next-generation system based on multi-million-dollar parameterized speech models that extends automatic speech recognition to more than 100 languages. In this post, we look at some of the benefits of this system, how businesses use it, and how to get started. We also provide an example of the transcription output below.
Transcribe’s core speech model is trained using best-in-class self-supervised algorithms to learn the inherent universal patterns of human speech across languages and accents. It is trained on millions of hours of unlabeled audio data from over 100 languages. Training recipes are optimized using intelligent data sampling to balance training data across languages, ensuring that traditionally underrepresented languages also achieve high levels of accuracy.
Carbyne is a software company that develops cloud-based, mission-critical contact center solutions for emergency call response services. Carbyne’s mission is to help emergency services save lives, and language can’t get in the way of its goals. Here’s how they use Amazon Transcribe to accomplish their mission:
“Carbyne Live Audio Translation, powered by ai, is directly intended to help improve emergency response for the 68 million Americans who speak a language other than English at home, in addition to the up to 79 million foreign visitors visiting the country.” annually. By leveraging Amazon Transcribe’s new multilingual ASR core model, Carbyne will be even better equipped to democratize life-saving emergency services, because Every. Person. Account.”
– Alex Dizengof, co-founder and CTO of Carbyne.
By leveraging the basic speech model, Amazon Transcribe delivers a significant improvement in accuracy of between 20% and 50% in most languages. In the case of telephony, which is a challenging and data-scarce field, the improvement in accuracy is between 30% and 70%. In addition to a substantial improvement in accuracy, this great ASR model also offers readability improvements with more accurate punctuation and capitalization. With the advent of generative ai, thousands of businesses are using Amazon Transcribe to unlock valuable insights from their audio content. With significantly improved accuracy and support for over 100 languages, Amazon Transcribe will positively impact all of these use cases. All new and existing customers using Amazon Transcribe in batch mode can access speech recognition based on the voice database model without requiring any changes to the API endpoint or input parameters.
The new ASR system offers several key features in more than 100 languages related to ease of use, personalization, user security and privacy. These include features such as automatic scoring, custom vocabulary, automatic language identification, speaker diary, word-level confidence scores, and custom vocabulary filter. The system’s expanded support for different accents, noise environments, and acoustic conditions allows you to produce more accurate results and therefore helps you effectively integrate voice technologies into your applications.
Thanks to Amazon Transcribe’s high accuracy in different accents and noise conditions, its support for a large number of languages, and its variety of value-added feature sets, thousands of businesses will be able to Unlock valuable insights from your audio content, as well as increase the accessibility and discoverability of your audio and video content across multiple domains. For example, contact centers transcribe and analyze customer calls to identify information and subsequently improve customer experience and agent productivity. Content producers and media distributors generate captions automatically using Amazon Transcribe to improve content accessibility.
Get started with Amazon Transcribe
You can use the AWS Command Line Interface (AWS CLI), the AWS Management Console, and various AWS SDKs for batch transcriptions and continue using them. StartTranscriptionJob
API to gain performance benefits from the enhanced ASR model without requiring any code or parameter changes on your part. For more information about using the AWS CLI and the console, see Transcription with AWS CLI and Transcription with AWS Management Console, respectively.
The first step is to upload your media files to an Amazon Simple Storage Service (Amazon S3) bucket, an object storage service built to store and retrieve any amount of data from anywhere. Amazon S3 offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at a very low cost. You can choose to save your transcription to your own S3 bucket or have Amazon Transcribe use a secure default bucket. For more information about using S3 buckets, see Creating, configuring, and working with Amazon S3 buckets.
Transcription output
Amazon Transcribe uses JSON representation for its output. It provides the transcription result in two different formats: text format and detailed format. Nothing changes regarding the API endpoint or input parameters.
The text format provides the transcript as a block of text, while the detailed format provides the transcript as transcribed elements arranged in a timely manner, along with additional metadata per element. Both formats exist in parallel in the output file.
Depending on the features you select when creating the transcription job, Amazon Transcribe creates additional, rich views of the transcription output. See the following example code:
The views are as follows:
- Transcripts – Represented by
transcripts
element, contains only the text format of the transcript. In multi-speaker, multi-channel scenarios, the concatenation of all transcripts is provided as a single block. - Speakers – Represented by
speaker_labels
element, contains the text and detailed transcription formats grouped by speaker. It is available only when the multi-speaker feature is enabled. - Channels – Represented by
channel_labels
element, contains the text and detailed formats of the transcript, grouped by channel. It is available only when the multichannel feature is enabled. - Items – Represented by
items
element, contains only the detailed formatting of the transcript. In multi-speaker and multi-channel scenarios, elements are enriched with additional properties, indicating the speaker and the channel. - Segments – Represented by
segments
element, contains the text and detailed formats of the transcript, grouped by alternative transcription. It is available only when the alternative results feature is enabled.
Conclusion
At AWS, we constantly innovate on behalf of our customers. By expanding language support in Amazon Transcribe to more than 100 languages, we enable our customers to serve users from diverse linguistic backgrounds. This not only improves accessibility, but also opens new avenues of communication and information exchange on a global scale. For more information on the features discussed in this post, see the features page and what’s new post.
About the authors
Sumit Kumar He is a Senior Product Manager and Technician on the AWS ai Language Services team. He has 10 years of product management experience across a variety of domains and is passionate about ai/ML. Outside of work, Sumit loves to travel and enjoys playing cricket and tennis.
Vivek Singh is a Senior Product Management Manager on the AWS ai Language Services team. She leads the Amazon Transcribe product team. Before joining AWS, he held product management roles in other Amazon organizations, including consumer payments and retail. Vivek lives in Seattle, WA and enjoys running and hiking.