Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

Rui Lu,Zhiyao Duan,Changshui Zhang

doi:10.1109/lsp.2018.2853566

Rui Lu, Zhiyao Duan + Show 1 more

Open Access

https://doi.org/10.1109/lsp.2018.2853566

Copy DOI

Journal: IEEE Signal Processing Letters	Publication Date: Sep 1, 2018
Citations: 71	License type: publisher-specific, author manuscript

Affiliation: University of Rochester, Tsinghua University

Abstract

Source permutation, i.e., assigning separated signal snippets to wrong sources over time, is a major issue in the state-of-the-art speaker-independent speech source separation methods. In addition to auditory cues, humans also leverage visual cues to solve this problem at cocktail parties: matching lip movements with voice fluctuations helps humans to better pay attention to the speaker of interest. In this letter, we propose an audio–visual matching network to learn the correspondence between voice fluctuations and lip movements. We then propose a framework to apply this network to address the source permutation problem and improve over audio-only speech separation methods. The modular design of this framework makes it easy to apply the matching network to any audio-only speech separation method. Experiments on two-talker mixtures show that the proposed approach significantly improves the separation quality over the state-of-the-art audio-only method. This improvement is especially pronounced on mixtures that the audio-only method fails, in which the speakers often have similar voice characteristics.

Full Text