3 questions: what you need to know about audio deepfakes | MIT News

Audio deepfakes have recently gotten some bad press after an ai-generated robocall purporting to be Joe Biden's voice hit New Hampshire residents. ai-generated-voice-could-mean-for-the-2024-election”>urging them not to cast votes. Meanwhile, phishers (phishing campaigns directed at a specific person or group, especially using information known to be of interest to the target) are going fishing. moneyand the actors try to preserve their sound image.

What gets less press, however, are some of the uses of audio deepfakes that could actually benefit society. In this Q&A prepared for MIT News, postdoc Nauman Dawalatabad addresses the concerns and potential benefits of the emerging technology. A more complete version of this interview can be seen in the video below.

Q: What ethical considerations justify concealing the identity of the source speaker in audio deepfakes, especially when this technology is used to create innovative content?

TO: Research into why research is important to obscure the identity of the source speaker, despite a large primary use of generative models for audio creation in entertainment, for example, raises ethical considerations. The speech does not contain information only about “who are you?” (identity) or “what are you talking about?” (content); It encapsulates a wealth of sensitive information including age, gender, accent, current health, and even signs of future health conditions. For example, our recent research article on “Detection of dementia from long neuropsychological interviews” demonstrates the feasibility of detecting dementia from speech with considerably high accuracy. Additionally, there are multiple models that can detect gender, accent, age, and other speech information with very high accuracy. There is a need to advance technology that protects against involuntary disclosure of such private data. The attempt to anonymize the identity of the source speaker is not simply a technical challenge but a moral obligation to preserve individual privacy in the digital age.

Q: How can we effectively overcome the challenges posed by audio spoofing in phishing attacks, taking into account the associated risks, the development of countermeasures, and the advancement of detection techniques?

TO: The use of audio deepfakes in phishing attacks introduces multiple risks, including the spread of misinformation and fake news, identity theft, privacy breaches, and malicious content alteration. The recent circulation of scam robocalls in Massachusetts exemplifies the detrimental impact of such technology. We also recently spoke with talked with The Boston Globe about this technology, and how easy and economical it is to generate such deepfake audios.

Anyone without significant technical training can easily generate such audio, with multiple tools available online. These fake news from deepfake generators can disrupt financial markets and even election results. Voice theft to access voice-operated bank accounts and unauthorized use of voice identity for financial gain are reminders of the urgent need for robust countermeasures. Other risks may include privacy violation, where an attacker can use the victim's audio without the victim's permission or consent. Additionally, attackers can also alter the content of the original audio, which can have a serious impact.

Two main and prominent directions have emerged in the design of systems for detecting fake audio: artifact detection and liveness detection. When audio is generated using a generative model, the model introduces some artifact into the generated signal. Researchers design algorithms/models to detect these artifacts. However, this approach poses some challenges due to the increasing sophistication of deepfake audio generators. In the future, we may also see models with very little or almost no artifacts. Liveness detection, on the other hand, takes advantage of the inherent qualities of natural speech, such as breathing patterns, intonations or rhythms, which are difficult for ai models to accurately replicate. Some companies like Pindrop are developing solutions of this type to detect audio fakes.

Additionally, strategies such as audio watermarking serve as proactive defenses, embedding encrypted identifiers within the original audio to trace its origin and prevent tampering. Despite other potential vulnerabilities, such as the risk of replay attacks, ongoing research and development in this field offers promising solutions to mitigate the threats posed by audio deepfakes.

Q: Despite its potential for misuse, what are some of the positive aspects and benefits of deepfake audio technology? How do you envision the future relationship between ai and our audio perception experiences evolving?

TO: Contrary to the predominant focus on the nefarious applications of audio deepfakes, the technology holds immense potential for positive impact across various sectors. Beyond the realm of creativity, where voice conversion technologies enable unprecedented flexibility in entertainment and media, audio deepfakes hold transformative promise in the healthcare and education sectors. My current work on anonymizing patient and clinician voices in cognitive healthcare interviews, for example, facilitates the sharing of medical data crucial to research globally, while ensuring privacy. Sharing this data among researchers fosters development in areas of cognitive healthcare. The application of this technology in voice restoration represents hope for people with speech disabilities, for example, ALS or dysarthric speech, improving communication abilities and quality of life.

I am very positive about the future impact of audio generative ai models. The future interaction between ai and audio perception is poised for revolutionary advances, particularly through the lens of psychoacoustics, the study of how humans perceive sounds. Innovations in augmented and virtual reality, exemplified by devices like the Apple Vision Pro and others, are pushing the boundaries of audio experiences toward unparalleled realism. We have recently seen an exponential increase in the number of sophisticated models appearing almost every month. This rapid pace of research and development in this field promises to not only perfect these technologies but also expand their applications in ways that profoundly benefit society. Despite the inherent risks, the potential for audio generative ai models to revolutionize healthcare, entertainment, education, and more is a testament to the positive trajectory of this field of research.