This paper, from Google Research (now Google AI, I guess), is well-written and clear, so I’ll just confine my notes to giving the context for the problem and what is unique and interesting about the paper’s solution.
This work fits into the category of using deep learning to enhance speech, with possible future applications to hearing assistance. People are generally pretty good at focusing on a single voice in a crowded room where there is background noise and many people talking at once. However, for people with hearing loss this ability rapidly diminishes, so there has long been interest in developing technology to compensate.
The challenge was dubbed “the cocktail-party problem” in 1953 and despite many attempts to solve it since then, success has been limited. Audio enhancements such as amplification and frequency adjustments aren’t effective, and computer algorithms struggle with it.
Some promising results have been achieved in the modern era of deep learning, beginning a couple of years ago with a technique called Deep Clustering[3]. Mixtures of voices could be separated into separate audio streams, one for each voice. Follow-on work provided several improvements, but still had several limitations:
the algorithms operate on clean speech samples that are well recorded with close-up mics
in other words, there is basically no background noise
you sort of have to know how many speakers are in the mixture
the quality of the separated speech is not spectacular
there’s no inherent way to indicate which speaker you want to listen to
This is far from the real-world problem we’d like to solve. Despite impressive progress, it is beginning to feel like some new ingredients are needed. Maybe hardware, maybe a language model that can infer from context like humans do, or maybe a new data source, like video. The video idea seems to have been in the air: three groups working on it independently published papers within a few days of each other [2][1][4].
Looking to Listen
This paper takes a new approach to combining video and audio to advance the state of the art on the various subproblems of the cocktail-party problem:
separating speech from simultaneous talkers (source separation)
suppressing background noise (speech enhancement)
designating which speaker you want to focus on (attention)
The algorithm is not doing lip reading, at least not explicitly. (Other groups are working on this though.) The authors just feed both audio and video data into a network and let it figure out what it needs to do.
It makes sense that video should help. I saw an estimate once that 30% of speech intelligibility comes from visual information. A convincing demonstration of this is the McGurk effect.
Training Data: AVSpeech
Where do you get training data for such a network? This is one of the major contributions of the paper: scraping YouTube and other online sources, filtering it down to usable samples, and extracting features. The result is a data set called AVSpeech that Google promises to release to other researchers.
Although the authors have some nice demos on real-world data, the training data is synthesized from separate, clean sources which also serve as ground truth. The sources are mixed together and noise is added to create a training sample. This solves the problem of acquiring enough labeled training data.
The video features are interesting. They first pick out all faces in a frame, then use face recognition to pick out one face, and create an embedding based on that face data.
The audio features are standard for speech processing work: a spectrogram, which is the collection of time-frequency (T-F) bins for the audio after segmenting into overlapping time frames and doing a Fourier transform in each frame. A spectrogram by its nature consists of complex numbers. Most algorithms take the magnitude and discard the phase. In this paper however, they kept the full complex values to get the best results.
The output of the network is a mask, which is a set of numbers, one for each T-F bin in the spectrogram, that are applied to the spectrogram by multiplying pointwise. The resulting spectrogram is inverted to produce an an audio (time-domain) signal.
The Network
The network is basically an RNN with some bidirectional LSTM (BLSTM) and fully connected layers. The loss function is the squared error between the clean and masked spectrograms.
They trained a separate network for each possible number of speakers. This seems like an unfortunate limitation, but they allude to not needing the number of speakers.
It is trained for 5 million (!) epochs, using Adam with a learning rate schedule.
The results are exciting (to me, anyway). Especially impressive are the applications to real-world situations, something not often seen in papers like these. The processing is not real-time, but presumably can get better with further work.
I found it unsurprising, but sad, that they had to code up their own benchmark algorithms <rant>because this area of machine learning is not as open and sharing as it should be</rant>. They beat the benchmark substantially for speech separation tasks.
They didn’t see a lot of improvement on the straight speech enhancement task. This is surprising to me, but they explain it by saying that the frequency content of noise is sufficiently different from speech that video doesn’t add that much new information. I’m not convinced—it depends on the noise, which after all could be a babble of voices.
The speech separation is quite effective on same-gender mixtures, which is noteworthy. Such data is generally quite challenging. The “double Brady” video is particularly interesting because not only is it an M-M mixture, it’s the same male voice that is being separated from itself!
The authors seem a little disappointed in the audio quality. I’m not. The outputs are fairly intelligible (by people with normal hearing). And, as shown by the video transcription demo, the artifacts don’t bother the speech recognition algorithms too much. It’s a good start.
Finally, it is worth noting, as the authors emphasize, that including video in the mix suggests a natural user interface for indicating what speaker you want to attend to: point.
References
[1] Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman: “The Conversation: Deep Audio-Visual Speech Enhancement”, arXiv preprint arXiv:1804.04121, 2018.
[2] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, Michael Rubinstein: “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation”, arXiv preprint arXiv:1804.03619, 2018.
[3] John R Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe: “Deep clustering: Discriminative embeddings for segmentation and separation”, Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31—35, 2016.
[4] Andrew Owens, Alexei A Efros: “Audio-Visual Scene Analysis with Self-Supervised Multisensory Features”, arXiv preprint arXiv:1804.03641, 2018.