Abstract

In this article, we propose a new blind speech extraction (BSE) method that robustly extracts a directional speech from background diffuse noise by combining independent low-rank matrix analysis (ILRMA) and efficient rank-constrained spatial covariance matrix (SCM) estimation. To achieve more accurate BSE than ILRMA, which assumes each source to be a point source (rank-1 spatial model), the proposed method restores the lost spatial basis for the full-rank SCM of diffuse noise. We adopt the multivariate complex generalized Gaussian distribution (GGD) as the statistical generative model to express various types of observed signal. To estimate the model parameters for an arbitrary shape parameter of the multivariate GGD, we derive a new inequality for rank-constrained SCMs. Also, we propose new acceleration methods to accomplish much faster extraction than conventional blind source separation methods. In BSE experiments using simulated and real recorded data, we confirm that the proposed method achieves more accurate and faster speech extraction than conventional methods.

Highlights

  • B LIND source separation (BSS) [1] is a technique for separating an observed multichannel signal, which is a mixture of multiple sources, into each source without any prior information about the sources or the mixing system

  • Let us denote a multichannel observed signal that is obtained via a short-time Fourier transform (STFT) as xij =T ∈ CM, where i = 1, . . . , I, j = 1, . . . , J, and m = 1, . . . , M are the indices of the frequency bins, time frames, and microphones, respectively, and T denotes the transpose

  • On the basis of the above motivation, we propose the following new estimation method for the full-rank spatial covariance matrix (SCM) of diffuse noise: (a) the rank-1 SCM for the directional target speech and rank-(M −1) SCM for diffuse noise are estimated by independent low-rank matrix analysis (ILRMA) and fixed, (b) the lost spatial basis for diffuse noise is restored to estimate noise components in the direction of the target speech, and (c) a multichannel Wiener filter is applied to suppress the noise components remaining in the separated directional target speech

Read more

Summary

INTRODUCTION

B LIND source separation (BSS) [1] is a technique for separating an observed multichannel signal, which is a mixture of multiple sources, into each source without any prior information about the sources or the mixing system. These methods assume a rank-1 spatial model; the frequency-wise acoustic path of each source can be represented by a single time-invariant spatial basis, which is often called a steering vector. Under this assumption, the determined BSS problem reduces to the estimation of a demixing matrix for each frequency. KUBO et al.: BLIND SPEECH EXTRACTION BASED ON RANK-CONSTRAINED SPATIAL COVARIANCE MATRIX ESTIMATION diffuse noise can cancel the directional target speech in the BSS methods based on the rank-1 spatial model [17], resulting in the accurate estimation of a rank-(M −1) diffuse noise SCM, where M denotes the number of microphones.

Definitions
MNMF and FastMNMF
Motivation and Strategy
Model and Speech Extraction
Optimization Framework
Generic Inequality and Identity for Rank-Constrained SCM Estimation
MM-Algorithm-Based and ME-Algorithm-Based Update Rules
Motivation
Key Concept
Second-Stage Acceleration
Advantage of Proposed Accelerated Update Rules
Experimental Condition
Comparison Between MM and ME Algorithms
SDR and SCM Behavior Comparison Between Proposed and Conventional Methods
Computational Time Comparison
BSE EXPERIMENT ON REAL RECORDED DATA
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call