Denoising Speech Based on Deep Learning and Wavelet Decomposition

Li Wang,Weiguang Zheng,Xiaojun Ma,Shiming Lin,Yi-Zhang Jiang

doi:10.1155/2021/8677043

Abstract

The work proposed a denoising speech method using deep learning. The predictor and target network signals were the amplitude spectra of the wavelet-decomposition vectors of the noisy audio signal and clean audio signal, respectively. The output of the network was the amplitude spectrum of the denoised signal. Besides, the regression network used the input of the predictor to minimize the mean square error between its output and input targets. The denoised wavelet-decomposition vector was transformed back to the time domain by the output amplitude spectrum and the phase of the wavelet-decomposition vector. Then, the denoised speech was obtained by the inverse wavelet transform. This method overcame the problem that the frequency and time resolution of the short-time Fourier transform could not be adjusted. The noise reduction effect in each frequency band was improved due to the gradual reduction of the noise energy in the wavelet-decomposition process. The experimental results showed that the method has a good denoising effect in the whole frequency band.

Highlights

IntroductionSpeech signals are inevitably affected by the noises from the surrounding environment, transmission media, and electrical noise inside the communication equipment. ese interferences greatly degrade the performance of the speech processing system and affect the quality of speech
In the actual environment, speech signals are inevitably affected by the noises from the surrounding environment, transmission media, and electrical noise inside the communication equipment. ese interferences greatly degrade the performance of the speech processing system and affect the quality of speech
Several speech-denoising and speech-enhancement methods have been proposed based on the statistical difference between the speech and noise characteristics, including spectral subtraction [1], based estimation [2], Wiener filtering [3], subspace method [4], nonnegative matrix factorization (NMF) [5], and minimum mean square error (MMSE) [6]

Summary

Introduction

Speech signals are inevitably affected by the noises from the surrounding environment, transmission media, and electrical noise inside the communication equipment. ese interferences greatly degrade the performance of the speech processing system and affect the quality of speech. Several speech-denoising and speech-enhancement methods have been proposed based on the statistical difference between the speech and noise characteristics, including spectral subtraction [1], based estimation [2], Wiener filtering [3], subspace method [4], nonnegative matrix factorization (NMF) [5], and minimum mean square error (MMSE) [6]. Most of the filtering methods are limited to windowadding or masking operation in the frequency domain or time domain due to the strong time-frequency coupling between speech signals and noises. It is difficult for these filtering methods to achieve effective signal-noise separation. The constraints on computing power and the size of training data lead to the implementations of relatively small neural networks, limiting denoising performance

Methods

Discussion

Conclusion