Podcasting has become a popular and powerful medium for storytelling, news and entertainment. Without transcriptions, podcasts may be inaccessible to people who are hard of hearing, deaf, or deafblind. However, ensuring that automatically generated podcast transcripts are readable and accurate is a challenge. The text should accurately reflect the meaning of what was discussed and be easy to read. The Apple Podcasts catalog contains millions of podcast episodes, which we transcribe using automatic speech recognition (ASR) models. To assess the quality of our ASR output, we compared a small number of human-generated or reference transcripts to corresponding ASR transcripts.
The industry standard for measuring transcription accuracy, word error rate (WER), lacks nuance. It equally penalizes all errors in ASR text (insertions, deletions, and substitutions) regardless of their impact on readability. Furthermore, reference text is subjective: it is based on what the human transcriber discerns while listening to the audio.
Leveraging existing research on improved readability metrics, we set ourselves the challenge of developing a more nuanced quantitative assessment of the readability of ASR passages. As it is shown in Figure 1, our solution is the Human Evaluation Word Error Rate (HEWER) metric. HEWER focuses on major errors, those that negatively affect readability, such as misspelled proper nouns, capitalization, and certain punctuation errors. HEWER ignores minor errors, such as filler words (“um”, “yeah”, “like”) or alternative spellings (“ok” versus “ok”). We found that for an 800-segment American English test set with an average ASR transcript WER of 9.2% taken from 61 podcast episodes, the HEWER was only 1.4%, indicating that the ASR transcripts were of Higher quality and more readable than WER might suggest. .
Our findings provide data-driven insights that we hope have laid the foundation for improving the accessibility of Apple Podcasts for millions of users. Additionally, Apple's engineering and product teams can use these insights to help connect audiences with more of the content they're looking for.
Select sample podcast segments
We worked with human annotators to identify and classify errors in 800 American English podcast segments extracted from manually transcribed episodes with a WER of less than 15%. We chose this maximum WER to ensure ASR transcripts in our evaluation samples:
- It met the quality threshold we expect for any transcript shown to an Apple Podcasts audience.
- We require our annotators to spend no more than 5 minutes classifying errors as major or minor.
Of the 66 podcast episodes in our initial data set, 61 met this criterion, representing 32 unique podcast shows. Figure 2 shows the selection process.
For example, an episode from the podcast show's initial data set Is this racist? titled “Cody's Marvel dot Ziglar (with Cody Ziglar)” had a WER of 19.2% and was excluded from our evaluation. But we included an episode titled “I'm Not Trying to Blow Up the Plantation, But…” from the same show, with a WER of 14.5%.
Segments with relatively higher episode WER were given greater weight in the selection process, because such episodes can provide more information than episodes whose ASR transcripts are near perfect. The episode's average WER across all segments was 7.5%, while the selected segments' average WER was 9.2%. Each audio segment was approximately 30 seconds long, which provided enough context for the annotators to understand the segments without making the task too tiring. Additionally, our goal was to select segments that began and ended at a phrase boundary, such as a sentence break or a long pause.
Evaluation of major and minor errors in transcript samples
WER is a widely used measure of the performance of speech recognition and machine translation systems. Divides the total number of errors in the automatically generated text by the total number of words in the human-generated (reference) text. Unfortunately, the WER score gives equal weight to all ASR errors (insertions, substitutions, and deletions), which can be misleading. For example, a passage with a high WER may still be readable or even indistinguishable in semantic content from the reference transcription, depending on the types of errors. Previous research on readability has focused on subjective and imprecise metrics. For example, in your article “A metric for evaluating speech recognizer output based on human perception model” Nobuyasu Itoh and his team came up with a scoring rubric on a scale of 0 to 5, with 0 being the highest quality. Participants in their experiment were first presented with automatically generated text without corresponding audio and were asked to rate the transcripts based on how easy the transcription was to understand. They then listened to the audio and rated the transcription for perceived accuracy.
Other research on readability, e.g. “The future of word error rate”—To our knowledge, it has not been implemented in any dataset at scale. To address these limitations, our researchers developed a new metric for measuring readability, HEWER, which is based on the WER scoring system.
The HEWER score provides human-centered information taking into account readability nuances. figure 3 shows three versions of a 30-second sample segment from the transcripts of the April 23, 2021 episode, “The pack” from the podcast show This American life.
Our data set comprised 30-second audio segments from a superset of 66 podcast episodes, and each segment's corresponding reference and model-generated transcripts. The human annotators began by identifying errors in writing, punctuation, or transcriptions, and classifying as “serious errors” only those errors that:
- The meaning of the text changed.
- It affected the readability of the text.
- Misspelled proper names
WER and HEWER are calculated based on an alignment of the reference and model-generated text. Figure 3 shows the score of each metric for the same result. WER counts errors as all words that differ between the reference text and that generated by the model, but ignores case and punctuation. HEWER, on the other hand, takes into account both case and punctuation and therefore the total number of tokens, shown in the denominator, is larger because each punctuation mark counts as one token.
Unlike WER, HEWER ignores minor errors, such as filler words like “uh”, which are only present in the reference transcript, or the use of “until” in model-generated text instead of “until” in the reference transcript. Additionally, HEWER ignores differences in comma placement that do not affect readability or meaning, as well as missing hyphens. The only major errors in the figure 3 HEWER samples are “quarantine” instead of “quarantine” and “antivirals” instead of “antivirals”.
In this case, the WER is somewhat high, 9.4%. However, that value gives us a false impression about the quality of the transcript generated by the model, which is actually quite readable. The HEWER value of 2.2% seems to indicate that it better reflects the human experience of reading the transcript.
Conclusion
Given the rigidity and limitations of WER, the established industry standard for measuring ASR accuracy, we are working to build on existing research and create HEWER, a more nuanced quantitative assessment of the readability of ASR passages. We applied this new metric to a data set of sample segments of automatically generated transcripts of podcast episodes to gain insight into the readability of the transcripts and help ensure the greatest accessibility and best possible experience for all Apple audiences and creators. Podcasts.
Expressions of gratitude
Many people contributed to this research, including Nilab Hessabi, Sol Kim, Filipe Minho, Issey Masuda Mora, Samir Patel, Alejandro Woodward Riquelme, João Pinto Carrilho Do Rosario, Clara Bonnin Rossello, Tal Singer, Eda Wang, Anne Wootton, Regan Xu , and Phil Zepeda.
Apple Resources
Apple Newsroom. 2024. “Apple Introduces Transcripts for Apple Podcasts.” (link.)
Apple Podcasts. and “Endless themes. Infinitely attractive.” (link.)
External references
Glass, Ira, host. 2021. “The Pack.” This American life. Podcast 736, April 23, 58:56. (link.)
Hughes, John. 2022. “The Future of Word Error Rate (WER).” Speech. (link.)
Itoh, Nobuyasu, Gakuto Kurata, Ryuki Tachibana, and Masafumi Nishimura, 2015, “A metric for evaluating speech recognizer output based on human perception model.” 16th Annual Conference of the International Speech Communication Association (Interspeech 2015). Speech beyond speech: Towards a better understanding of the most important biosignal, 1285–88. (link.)