Whether you're describing the sound of your car's engine breaking down or meowing like your neighbor's cat, imitating sounds with your voice can be a useful way to convey a concept when words don't work.
Vocal imitation is the sonic equivalent of scribbling a quick image to communicate something you saw, except instead of using a pencil to illustrate an image, you use your vocal tract to express a sound. This may seem difficult, but it's something we all do intuitively: to experience it for yourself, try using your voice to reflect the sound of an ambulance siren, a crow, or the ringing of a bell.
Inspired by the cognitive science of how we communicate, researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) have developed an artificial intelligence system that can produce human-like vocal imitations without training and having “heard.” “before a human vocal impression. .
To achieve this, the researchers designed their system to produce and interpret sounds much like we do. They began by building a model of the human vocal tract that simulates how vibrations from the larynx are shaped by the throat, tongue and lips. They then used a cognitively inspired ai algorithm to control this vocal tract model and make it produce imitations, taking into account the context-specific ways humans choose to communicate sound.
The model can effectively take many sounds from the world and generate a human imitation of them, including noises such as rustling leaves, the hissing of a snake, and the siren of an approaching ambulance. Their model can also be run in reverse to guess real-world sounds from human vocal imitations, similar to how some computer vision systems can recover high-quality images based on sketches. For example, the model can correctly distinguish the sound of a human imitating a cat's “meow” versus its “hiss.”
In the future, this model could lead to more intuitive “imitation-based” interfaces for sound designers, more human-like ai characters in virtual reality, and even methods to help students learn new languages.
Co-lead authors, MIT CSAIL doctoral students Kartik Chandra SM '23 and Karima Ma, and undergraduate researcher Matthew Caren, note that computer graphics researchers have long recognized that realism is rarely the end goal. of visual expression. For example, an abstract painting or a child's crayon doodle can be as expressive as a photograph.
“In recent decades, advances in drawing algorithms have led to new tools for artists, advances in artificial intelligence and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “In the same way that a sketch is an abstract, non-photorealistic representation of an image, our method captures the abstract, non-photorealistic representation.–Realistic ways humans express the sounds they hear. “This teaches us about the process of auditory abstraction.”
“The goal of this project has been to understand and computationally model vocal imitation, which we consider to be a kind of auditory equivalent of drawing in the visual domain,” Caren says.
The art of imitation, in three parts
The team developed three increasingly nuanced versions of the model to compare with human vocal imitations. First, they created a reference model that simply aimed to generate imitations that were as similar as possible to real-world sounds, but this model did not correspond very well to human behavior.
The researchers then designed a second “communicative” model. According to Caren, this model considers what is distinctive about a sound to the listener. For example, you would probably imitate the sound of a motorboat by imitating the noise of its engine, since that is its most distinctive auditory characteristic, even if it is not the loudest aspect of the sound (compared to, say, the lapping of water ). This second model created better imitations than the base, but the team wanted to improve it even further.
To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different depending on the amount of effort you put into them. It takes time and energy to produce sounds that are perfectly accurate,” says Chandra. The researchers' full model takes this into account by trying to avoid very fast, strong, high-pitched or low-pitched expressions, which people are less likely to use in conversation. The result: more human imitations that closely resemble many of the decisions humans make when imitating the same sounds.
After building this model, the team conducted a behavioral experiment to see whether human judges perceived ai– or human-generated vocal imitations better. In particular, participants in the experiment preferred the ai model 25 percent of the time overall, and up to 75 percent for a motorboat imitation and 50 percent for a gunshot imitation.
Towards more expressive sound technology
Passionate about technology for music and art, Caren imagines this model could help artists better communicate sounds to computer systems and help filmmakers and other content creators generate ai sounds that are more nuanced for a specific context. It could also allow a musician to quickly search a database of sounds by imitating a noise that is difficult to describe, for example, in a text message.
Meanwhile, Caren, Chandra and Ma are looking at the implications of their model in other areas, including language development, how babies learn to talk, and even imitation behaviors in birds like parrots and songbirds.
The team still has work to do with the current iteration of their model: It has problems with some consonants, such as “z,” which led to inaccurate impressions of some sounds, such as the humming of bees. They also cannot yet replicate how humans imitate speech, music or sounds that are imitated differently in different languages, such as the heartbeat.
Stanford University linguistics professor Robert Hawkins says language is full of onomatopoeia and words that imitate but don't fully replicate the things they describe, like the sound “meow” that very inaccurately approximates the sound that cats emit. “The processes that take us from the sound of a real cat to a word like 'meow' reveal a lot about the intricate interplay between physiology, social reasoning and communication in the evolution of language,” says Hawkins, who was not involved in the study. CSAIL research. “This model presents an exciting step toward formalizing and testing theories of those processes, demonstrating that both the physical limitations of the human vocal tract and the social pressures of communication are necessary to explain the distribution of vocal imitations.”
Caren, Chandra, and Ma wrote the paper with two other CSAIL affiliates: Jonathan Ragan-Kelley, an associate professor in the MIT Department of Electrical Engineering and Computer Science, and Joshua Tenenbaum, a professor of Brain and Cognitive Sciences at MIT and the Center for Brains, Minds and Machines. member. His work was supported, in part, by the Hertz Foundation and the National Science Foundation. It was launched at SIGGRAPH Asia in early December.