Researchers at MIT’s Computer Science and Artificial Intelligence Lab are teaching computers about the relationship between sound and vision. The team has created an artificial intelligence system that can not only predict what sounds are linked to certain images, but can mimic those sounds itself. Popular Science reports that they've created a deep-learning algorithm so skilled at re-creating sounds that it can even trick humans—a kind of "Turing Test for sound," as the researchers describe it.
In order to teach the computer about sound, researchers recorded 1000 videos of a drumstick hitting, scraping, and tapping different surfaces. All in all, the videos captured some 46,000 sounds. Using those videos, the computer taught itself which sounds matched up to specific images—for instance, learning to distinguish between the sound of a drumstick hitting a surface, splashing water, rustling leaves, and tapping a metallic surface.
To test just how much the computer had learned, researchers presented it with a series of new videos, also of a drumstick tapping different surfaces, with the sound removed. Using the existing dataset of sounds, which researchers dubbed their ‘Greatest Hits,’ the computer created new sounds for the new videos. The computer took tiny sound clips from the original videos and stitched them together to create totally new sound combinations.
When researchers presented human volunteers with the computer-generated sounds, they were, for the most part, unable to distinguish them from real sounds. In some cases, participants were even more likely to choose the computer’s fake sounds over real sounds.
Researchers believe that the technology they’ve created could one day be used to automatically generate sound effects for movies and TV. They also say it can help robots better understand the physical world, learning to distinguish between objects that are soft and hard, or rough and smooth, by the sounds they make.
“A robot could look at a sidewalk and instinctively know that the cement is hard and the grass is soft, and therefore know what would happen if they stepped on either of them,” researcher Andrew Owens explains. “Being able to predict sound is an important first step toward being able to predict the consequences of physical interactions with the world.”