Abstract
This article describes a computationally-efficient blind source separation (BSS) method based on the independence, low-rankness, and directivity of the sources. A typical approach to BSS is unsupervised learning of a probabilistic model that consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. Building upon the low-rank source model based on nonnegative matrix factorization (NMF), which has been considered to be effective for inter-frequency source alignment, multichannel NMF (MNMF) assumes source images to follow multivariate complex Gaussian distributions with unconstrained full-rank spatial covariance matrices (SCMs). An effective way of reducing the computational cost and initialization sensitivity of MNMF is to restrict the degree of freedom of SCMs. While a variant of MNMF called independent low-rank matrix analysis (ILRMA) severely restricts SCMs to rank-1 matrices under an idealized condition that only directional and less-echoic sources exist, we restrict SCMs to jointly-diagonalizable yet full-rank matrices in a frequency-wise manner, resulting in FastMNMF1. To help inter-frequency source alignment, we then propose FastMNMF2 that shares the directional feature of each source over all frequency bins. To explicitly consider the directivity or diffuseness of each source, we also propose rank-constrained FastMNMF that enables us to individually specify the ranks of SCMs. Our experiments showed the superiority of FastMNMF over MNMF and ILRMA in speech separation and the effectiveness of the rank constraint in speech enhancement.
Highlights
M ULTICHANNEL source separation is one of the most fundamental techniques for computational auditory scene analysis including automatic speech recognition and acoustic event detection [1], [2]
We found no significant difference between RC-FastMNMF2 and FastMNMF2 (4.1 dB) because the rank-1 assumption on the spatial covariance matrices (SCMs) of speech was violated by the heavy reverberation, which was much longer than the short-time Fourier transform (STFT) window size
Two-step RC-FastMNMF1 with K = 4 (4.3 dB) outperformed RC-FastMNMF2 (4.1 dB), Q estimated by independent low-rank matrix analysis (ILRMA) was considered to be sub-optimal, as discussed in Sections V-F and V-G
Summary
M ULTICHANNEL source separation is one of the most fundamental techniques for computational auditory scene analysis including automatic speech recognition and acoustic event detection [1], [2]. As a front end of these tasks, deep neural networks (DNNs) are often trained by using pairs of mixture and isolated signals [3]–[7] Such a supervised approach works well in a known environment, it often fails to generalize to an unseen environment [8], [9]. Another approach is blind source separation (BSS) based on unsupervised learning of a probabilistic model that represents a multichannel mixture spectrogram as the sum of multichannel source spectrograms called images. In a typical spatial model, the TF bins of each source image are assumed to independently follow multivariate complex Gaussian distributions with spatial covariance matrices (SCMs) [11]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.