Fast Multichannel Nonnegative Matrix Factorization With Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation

Kouhei Sekiguchi,Tatsuya Kawahara,Yoshiaki Bando,Kazuyoshi Yoshii,Aditya Arie Nugraha

doi:10.1109/taslp.2020.3019181

Abstract

This article describes a computationally-efficient blind source separation (BSS) method based on the independence, low-rankness, and directivity of the sources. A typical approach to BSS is unsupervised learning of a probabilistic model that consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. Building upon the low-rank source model based on nonnegative matrix factorization (NMF), which has been considered to be effective for inter-frequency source alignment, multichannel NMF (MNMF) assumes source images to follow multivariate complex Gaussian distributions with unconstrained full-rank spatial covariance matrices (SCMs). An effective way of reducing the computational cost and initialization sensitivity of MNMF is to restrict the degree of freedom of SCMs. While a variant of MNMF called independent low-rank matrix analysis (ILRMA) severely restricts SCMs to rank-1 matrices under an idealized condition that only directional and less-echoic sources exist, we restrict SCMs to jointly-diagonalizable yet full-rank matrices in a frequency-wise manner, resulting in FastMNMF1. To help inter-frequency source alignment, we then propose FastMNMF2 that shares the directional feature of each source over all frequency bins. To explicitly consider the directivity or diffuseness of each source, we also propose rank-constrained FastMNMF that enables us to individually specify the ranks of SCMs. Our experiments showed the superiority of FastMNMF over MNMF and ILRMA in speech separation and the effectiveness of the rank constraint in speech enhancement.

Highlights

M ULTICHANNEL source separation is one of the most fundamental techniques for computational auditory scene analysis including automatic speech recognition and acoustic event detection [1], [2]
We found no significant difference between RC-FastMNMF2 and FastMNMF2 (4.1 dB) because the rank-1 assumption on the spatial covariance matrices (SCMs) of speech was violated by the heavy reverberation, which was much longer than the short-time Fourier transform (STFT) window size
Two-step RC-FastMNMF1 with K = 4 (4.3 dB) outperformed RC-FastMNMF2 (4.1 dB), Q estimated by independent low-rank matrix analysis (ILRMA) was considered to be sub-optimal, as discussed in Sections V-F and V-G

Summary

Introduction

M ULTICHANNEL source separation is one of the most fundamental techniques for computational auditory scene analysis including automatic speech recognition and acoustic event detection [1], [2]. As a front end of these tasks, deep neural networks (DNNs) are often trained by using pairs of mixture and isolated signals [3]–[7] Such a supervised approach works well in a known environment, it often fails to generalize to an unseen environment [8], [9]. Another approach is blind source separation (BSS) based on unsupervised learning of a probabilistic model that represents a multichannel mixture spectrogram as the sum of multichannel source spectrograms called images. In a typical spatial model, the TF bins of each source image are assumed to independently follow multivariate complex Gaussian distributions with spatial covariance matrices (SCMs) [11]

Objectives

Methods

Results

Conclusion