Adapted from story by Sarah McQuate | UW News
Anyone who’s been to a concert knows that something magical happens between the performers and their instruments. It transforms music from being just “notes on a page” to a satisfying experience.
A team at the University of Washington Department of Electrical & Computer Engineering (UW ECE) wondered if artificial intelligence could recreate that delight using only visual cues — a silent, top-down video of someone playing the piano. The researchers used machine learning to create a system, called Audeo, that creates audio from silent piano performances. When the group tested the music Audeo created with music-recognition apps, such as SoundHound, the apps correctly identified the piece Audeo played about 86% of the time. For comparison, these apps identified the piece in the audio tracks from the source videos 93% of the time.
The researchers presented Audeo on December 8 at the NeurIPS 2020 conference.
“To create music that sounds like it could be played in a musical performance was previously believed to be impossible,” said senior author Eli Shlizerman, the Washington Research Foundation Assistant Professor in both the electrical and computer engineering and the applied mathematics departments. “An algorithm needs to figure out the cues, or ‘features,’ in the video frames that are related to generating music, and it needs to ‘imagine’ the sound that’s happening in between the video frames. It requires a system that is both precise and imaginative. The fact that we achieved music that sounded pretty good was a surprise.”
Audeo uses a series of steps to decode what’s happening in the video and then translate it into music. First, it has to detect which keys are pressed in each video frame to create a diagram over time (piano roll). Then it needs to translate that diagram into something that a music synthesizer would actually recognize as a sound a piano would make (MIDI roll). This second step cleans up the data and adds in more information, such as how strongly each key is pressed (note velocity) and for how long (note duration and decay).
“If we attempt to synthesize music from the first step alone, we would find the quality of the music to be unsatisfactory,” Shlizerman said. “The second step is like how a teacher goes over a student composer’s music and helps enhance it.”
The researchers trained and tested the system using YouTube videos of the pianist Paul Barton. The training consisted of about 172,000 video frames of Barton playing music from well-known classical composers, such as Bach and Mozart. Then they tested Audeo with almost 19,000 frames of Barton playing different music from these composers and others, such as Scott Joplin.
Once Audeo has generated a transcript of the music, it’s time to give it to a synthesizer that can translate it into sound. Every synthesizer will make the music sound a little different — this is similar to changing the “instrument” or timbre setting on an electric keyboard. For this study, the researchers used two different synthesizers.
“Fluidsynth makes synthesizer piano sounds that we are familiar with. These are somewhat mechanical-sounding but pretty accurate,” Shlizerman said. “We also used PerfNet, a new AI synthesizer that generates richer and more expressive music. But it also generates more noise.”
Audeo was trained and tested only on Paul Barton’s piano videos. Future research is needed to see how well it could transcribe music for any musician or piano, Shlizerman said.
“The goal of this study was to see if artificial intelligence could generate music that was played by a pianist in a video recording — though we were not aiming to replicate Paul Barton because he is such a virtuoso,” Shlizerman said. “We hope that our study enables novel ways to interact with music. For example, one future application is that Audeo can be extended to a virtual piano with a camera recording just a person’s hands. Also, by placing a camera on top of a real piano, Audeo could potentially assist in new ways of teaching students how to play.”
Kun Su and Xiulong Liu, both doctoral students in electrical and computer engineering, are co-authors on this paper. This research was funded by the Washington Research Foundation Innovation Fund as well as the electrical and computer engineering and applied mathematics departments.