Speech separation aims to separate individual voices from an audio mixture of multiple simultaneous talkers. Audio-only approaches show unsatisfactory performance when the speakers are of the same gender or share similar voice characteristics. This is due to challenges on learning appropriate feature representations for separating voices in single frames and streaming voices across time. Visual signals of speech (e.g., lip movements), if available, can be leveraged to learn better feature representations for separation. In this paper, we propose a novel audio–visual deep clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time–Frequency (T–F) bin clustering. It employs a two-stage audio–visual fusion strategy where speaker-wise audio–visual T–F embeddings are first computed after the first-stage fusion to model the audio–visual correspondence for each speaker. In the second-stage fusion, audio–visual embeddings of all speakers and audio embeddings calculated by deep clustering from the audio mixture are concatenated to form the final T–F embedding for clustering. Through a series of experiments, the proposed AVDC model is shown to outperform the audio-only deep clustering and utterance-level permutation invariant training baselines and three other state-of-the-art audio–visual approaches. Further analyses show that the AVDC model learns a better T–F embedding for alleviating the source permutation problem across frames. Other experiments show that the AVDC model is able to generalize across different numbers of speakers between training and testing and shows some robustness when visual information is partially missing.
Read full abstract