In 2019, we launched Recorder, an audio recording app for Pixel phones that helps users create, manage and edit audio recordings. Take advantage of recent developments in on-device machine learning to transcribe speech, recognize audio eventssuggest tags for titles and help users navigate transcripts.
However, some Recorder users found it difficult to navigate long recordings that have multiple speakers because it’s not clear who said what. During the made by google event this year, we announce the “speaker labels” for the Recorder application. This optional feature annotates a recording transcript with unique and anonymous tags for each speaker (eg “Speaker 1”, “Speaker 2”, etc.) in real time during recording. Significantly improves readability and the ease of use of recording transcripts.This feature is powered by the new Google feature speaker diarization system named Turn-to-Diarizwhich was first introduced in ICASSP 2022.
Left: Recorder transcription without announcer tags. Right: Recorder transcript with speaker tags. |
System architecture
Our speaker diarization system leverages several models and highly optimized machine learning algorithms to enable diarization of hours of audio in the form of a real-time stream with limited computational resources on mobile devices. The system consists mainly of three components: a speaker rotation detection a model that detects a speaker change in the input speech, a speaker encoder model that extracts the voice characteristics of each speaker turn, and a multi-stage bundling algorithm that annotates the speaker tags at each speaker turn in a very efficient way. All components run entirely on the device.
Turn-to-Diarize system architecture. |
Speaker rotation detection
The first component of our system is a speaker rotation detection model based on a transformer transducer (TT), which converts acoustic features into augmented text transcripts with a special token <st>
representing a speaker turn. Unlike previous custom systems that use role-specific tokens (eg, <doctor>
Y <patient>
) for conversations, this model is more generic and can be trained and deployed across multiple application domains.
In most applications, the output of a dialyzation system is not displayed directly to users, but is instead combined with a separate Automatic Speech Recognition (ASR) system that is trained to have smaller word errors. Therefore, for the diarization system, we are relatively more tolerant of word token errors than we are of word token errors. <st>
symbolic. Starting from this intuition, we propose a new token level loss function which allows us to train a small speaker gyro detection model with high precision in predicted <st>
chips Combined with minimal Bayesian risk based on edit (EMBR), this new loss function significantly improved the interval-based F1 score in seven evaluation data sets.
Speech Feature Extraction
Once the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder model to extract an embedding vector (i.e., d-vector) to represent the voice characteristics of each speaker turn. This approach has several advantages over priority work which extracts embedding vectors from small segments of fixed length. First, it avoids extracting an embedding from a segment that contains the speech of multiple speakers. At the same time, each key covers a relatively large time range that contains enough signals from the speaker. It also reduces the total number of embeds to be clustered, making the clustering step less expensive. These embeds are fully processed on the device until the speaker tagging of the transcript is complete, and then they are removed.
Multi-Stage Clustering
After the audio recording is represented by a sequence of key vectors, the last step is to group these key vectors and assign a speaker label to each one. However, since the Recorder app’s audio recordings can be as short as a few seconds or as long as 18 hours, it’s critical that the grouping algorithm handle sequences of drastically different durations.
For this we propose a multi-stage bundling strategy to take advantage of the benefits of different clustering algorithms. First, we use the speaker gyro detect outputs to determine if there are at least two different speakers in the recording. For short sequences, we use agglomerative hierarchical clustering (AHC) as an alternative algorithm. For medium-length sequences, we use spectral grouping as our main algorithm, and we use the own gap criterion for an accurate estimate of speaker count. For long sequences, we reduce the computational cost by using AHC to pre-cluster the sequence before sending it to the main algorithm. During streaming, we maintain a dynamic cache of previous AHC cluster centroids that can be reused for future clustering calls. This mechanism allows us to impose an upper bound on the entire system with constant time and space complexity.
This multi-stage clustering strategy is a critical optimization for on-device applications where CPU, memory, and battery budgets are very small, and allows the system to run in a low-power mode even after hours of audio. As a tradeoff between quality and efficiency, the upper limit of the computational cost can be flexibly set for devices with different computational resources.
Diagram of the multi-stage clustering strategy. |
Correction and Personalization
In our real-time streaming speaker biasing system, as the model consumes more audio input, it builds up confidence in the predicted speaker labels and may occasionally make corrections to previously predicted low-confidence speaker labels. The Recorder app automatically updates the speaker labels on the screen during recording to reflect the latest and most accurate predictions.
At the same time, the Recorder app UI allows the user to rename anonymous speaker tags (eg “Speaker 2”) to custom tags (eg “car dealer” ) for better readability and easier memorization for the user within each recording.
Recorder allows the user to rename speaker labels for better readability. |
Future work
Currently, our diariation system runs mainly on the CPU block of google tensioner, Google’s custom chip that powers the latest Pixel phones. We are working to delegate more computations to the TPU block, which will further reduce the overall power consumption of the diarization system. Another direction of future work is to take advantage of the multilingual capabilities of speaker encoder Y speech recognition models to extend this function to more languages.
expressions of gratitude
The work described in this post represents the joint efforts of various teams within Google. Contributors include Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio López Moreno, Diego Melendo Married, Nino Tasca, Alex Gruenstein.