Abstract

Source permutation, i.e., assigning separated signal snippets to wrong sources over time, is a major issue in the state-of-the-art speaker-independent speech source separation methods. In addition to auditory cues, humans also leverage visual cues to solve this problem at cocktail parties: matching lip movements with voice fluctuations helps humans to better pay attention to the speaker of interest. In this letter, we propose an audio–visual matching network to learn the correspondence between voice fluctuations and lip movements. We then propose a framework to apply this network to address the source permutation problem and improve over audio-only speech separation methods. The modular design of this framework makes it easy to apply the matching network to any audio-only speech separation method. Experiments on two-talker mixtures show that the proposed approach significantly improves the separation quality over the state-of-the-art audio-only method. This improvement is especially pronounced on mixtures that the audio-only method fails, in which the speakers often have similar voice characteristics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call