Abstract

This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank speech model with a deep generative speech model, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.

Highlights

  • S PEECH enhancement plays a vital role for automatic speech recognition (ASR) in noisy environments

  • Our model can work even in a single-channel scenario without spatial information [22]. This is a noticeable advantage of the proposed method over multichannel nonnegative matrix factorization (MNMF) that heavily relies on G for speech enhancement

  • This paper presented a semi-supervised multichannel speech enhancement method that integrates a deep neural networks (DNNs)-based generative model of speech spectra, an nonnegative matrix factorization (NMF)-based generative model of noise spectra, and a full-rank or rank-1 spatial model in a unified probabilistic model

Read more

Summary

Introduction

S PEECH enhancement plays a vital role for automatic speech recognition (ASR) in noisy environments. The performance and robustness of ASR have been drastically improved thanks to the development of deep learning techniques, ASR in unseen noisy environments that are not covered by training data is still an open problem. Manuscript received January 31, 2019; revised May 28, 2019 and August 28, 2019; accepted September 11, 2019. Date of publication October 7, 2019; date of current version November 26, 2019. The associate editor coordinating the review of this manuscript and approving it for publication was Dr Maria de Diego. The associate editor coordinating the review of this manuscript and approving it for publication was Dr Maria de Diego. (Corresponding author: Kouhei Sekiguchi.)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call