Audio–Visual Deep Clustering for Speech Separation

Rui Lu,Changshui Zhang,Zhiyao Duan

doi:10.1109/taslp.2019.2928140

Rui Lu, Changshui Zhang + Show 1 more

Open Access

https://doi.org/10.1109/taslp.2019.2928140

Copy DOI

Abstract

Speech separation aims to separate individual voices from an audio mixture of multiple simultaneous talkers. Audio-only approaches show unsatisfactory performance when the speakers are of the same gender or share similar voice characteristics. This is due to challenges on learning appropriate feature representations for separating voices in single frames and streaming voices across time. Visual signals of speech (e.g., lip movements), if available, can be leveraged to learn better feature representations for separation. In this paper, we propose a novel audio–visual deep clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time–Frequency (T–F) bin clustering. It employs a two-stage audio–visual fusion strategy where speaker-wise audio–visual T–F embeddings are first computed after the first-stage fusion to model the audio–visual correspondence for each speaker. In the second-stage fusion, audio–visual embeddings of all speakers and audio embeddings calculated by deep clustering from the audio mixture are concatenated to form the final T–F embedding for clustering. Through a series of experiments, the proposed AVDC model is shown to outperform the audio-only deep clustering and utterance-level permutation invariant training baselines and three other state-of-the-art audio–visual approaches. Further analyses show that the AVDC model learns a better T–F embedding for alleviating the source permutation problem across frames. Other experiments show that the AVDC model is able to generalize across different numbers of speakers between training and testing and shows some robustness when visual information is partially missing.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Nov 1, 2019
Citations: 85	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Audio–Visual Deep Clustering for Speech Separation

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Image deep clustering based on local-topology embedding
Jing Pan ... Yuhua Qian
Pattern Recognition Letters | VOL. 151
Jing Pan, et. al.Jing Pan ... Yuhua Qian
01 Nov 2021
Pattern Recognition Letters | VOL. 151

1DCAE-TSSAMC: Two-Stage Multi-Dimensional Spatial Features Based Multi-View Deep Clustering for Time Series Data
Jianglong Chen ... Weiwei Song
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems | VOL. 32
Jianglong Chen, et. al.Jianglong Chen ... Weiwei Song
01 Jun 2024
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems | VOL. 32

MICCF: A Mutual Information Constrained Clustering Framework for Learning Clustering-Oriented Feature Representations
Hongyu Li ... Wei Yu
ACM Transactions on Knowledge Discovery from Data | VOL. -
Hongyu Li, et. al.Hongyu Li ... Wei Yu
07 Jul 2024
ACM Transactions on Knowledge Discovery from Data | VOL. -

Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation
Yanli Ji ... Shuo Ma
IEEE Transactions on Multimedia | VOL. 25
Yanli Ji, et. al.Yanli Ji ... Shuo Ma
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio–Visual Deep Clustering for Speech Separation

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing