Disentangled Feature Learning for Noise-Invariant Speech Enhancement

Soo Hyun Bae,Nam Soo Kim,Inkyu Choi

doi:10.3390/app9112289

Abstract

Most of the recently proposed deep learning-based speech enhancement techniques have focused on designing the neural network architectures as a black box. However, it is often beneficial to understand what kinds of hidden representations the model has learned. Since the real-world speech data are drawn from a generative process involving multiple entangled factors, disentangling the speech factor can encourage the trained model to result in better performance for speech enhancement. With the recent success in learning disentangled representation using neural networks, we explore a framework for disentangling speech and noise, which has not been exploited in the conventional speech enhancement algorithms. In this work, we propose a novel noise-invariant speech enhancement method which manipulates the latent features to distinguish between the speech and noise features in the intermediate layers using adversarial training scheme. To compare the performance of the proposed method with other conventional algorithms, we conducted experiments in both the matched and mismatched noise conditions using TIMIT and TSPspeech datasets. Experimental results show that our model successfully disentangles the speech and noise latent features. Consequently, the proposed model not only achieves better enhancement performance but also offers more robust noise-invariant property than the conventional speech enhancement techniques.

Highlights

Speech enhancement techniques aim to improve the quality and intelligibility of a given speech degraded by certain additive noise in the background
TIMIT database consists of 10 sentences, each spoken by 630 English speakers
We proposed a novel speech enhancement method in which speech and noise latent features were disentangled via adversarial learning

Summary

Introduction

Speech enhancement techniques aim to improve the quality and intelligibility of a given speech degraded by certain additive noise in the background. In a variety of applications, speech enhancement is considered as an essential pre-processing step. This technique can be directly employed to improve the quality of mobile communications [1] in noisy environments or to enhance speech signals for hearing aid devices [2,3] before amplification. Speech enhancement has been widely used as a pre-processing technique in automatic speech recognition (ASR) [4,5] and speaker recognition systems [6] for more robust performances. The least mean square adaptive filtering (LMSAF) based speech enhancement approaches have the best filtering performances of Wiener filter

Methods

Results

Conclusion