Source Permutation Research Articles

Speech separation aims to separate individual voices from an audio mixture of multiple simultaneous talkers. Audio-only approaches show unsatisfactory performance when the speakers are of the same gender or share similar voice characteristics. This is due to challenges on learning appropriate feature representations for separating voices in single frames and streaming voices across time. Visual signals of speech (e.g., lip movements), if available, can be leveraged to learn better feature representations for separation. In this paper, we propose a novel audio–visual deep clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time–Frequency (T–F) bin clustering. It employs a two-stage audio–visual fusion strategy where speaker-wise audio–visual T–F embeddings are first computed after the first-stage fusion to model the audio–visual correspondence for each speaker. In the second-stage fusion, audio–visual embeddings of all speakers and audio embeddings calculated by deep clustering from the audio mixture are concatenated to form the final T–F embedding for clustering. Through a series of experiments, the proposed AVDC model is shown to outperform the audio-only deep clustering and utterance-level permutation invariant training baselines and three other state-of-the-art audio–visual approaches. Further analyses show that the AVDC model learns a better T–F embedding for alleviating the source permutation problem across frames. Other experiments show that the AVDC model is able to generalize across different numbers of speakers between training and testing and shows some robustness when visual information is partially missing.

In this paper, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the cross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace and pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.

Source Permutation Research Articles

Articles published on Source Permutation

Audio–Visual Deep Clustering for Speech Separation

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.

Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment

Adaptive Beamforming With a Minimum Mutual Information Criterion

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

On regularity and identifiability of blind source separation under constant-modulus constraints

Blind separation of convolved cyclostationary processes

Optimum quantizers and permutation codes

Permutation codes for sources

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Source Permutation Research Articles

Articles published on Source Permutation

Audio–Visual Deep Clustering for Speech Separation

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.

Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment

Adaptive Beamforming With a Minimum Mutual Information Criterion

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

On regularity and identifiability of blind source separation under constant-modulus constraints

Blind separation of convolved cyclostationary processes

Optimum quantizers and permutation codes

Permutation codes for sources