Ideal ratio mask estimation using supervised DNN approach for target speech signal enhancement

Poovarasan Selvaraj,E. Chandra

doi:10.3233/jifs-211236

Poovarasan Selvaraj, E. Chandra

Open Access

https://doi.org/10.3233/jifs-211236

Copy DOI

Abstract

The most challenging process in recent Speech Enhancement (SE) systems is to exclude the non-stationary noises and additive white Gaussian noise in real-time applications. Several SE techniques suggested were not successful in real-time scenarios to eliminate noises in the speech signals due to the high utilization of resources. So, a Sliding Window Empirical Mode Decomposition including a Variant of Variational Model Decomposition and Hurst (SWEMD-VVMDH) technique was developed for minimizing the difficulty in real-time applications. But this is the statistical framework that takes a long time for computations. Hence in this article, this SWEMD-VVMDH technique is extended using Deep Neural Network (DNN) that learns the decomposed speech signals via SWEMD-VVMDH efficiently to achieve SE. At first, the noisy speech signals are decomposed into Intrinsic Mode Functions (IMFs) by the SWEMD Hurst (SWEMDH) technique. Then, the Time-Delay Estimation (TDE)-based VVMD was performed on the IMFs to elect the most relevant IMFs according to the Hurst exponent and lessen the low- as well as high-frequency noise elements in the speech signal. For each signal frame, the target features are chosen and fed to the DNN that learns these features to estimate the Ideal Ratio Mask (IRM) in a supervised manner. The abilities of DNN are enhanced for the categories of background noise, and the Signal-to-Noise Ratio (SNR) of the speech signals. Also, the noise category dimension and the SNR dimension are chosen for training and testing manifold DNNs since these are dimensions often taken into account for the SE systems. Further, the IRM in each frequency channel for all noisy signal samples is concatenated to reconstruct the noiseless speech signal. At last, the experimental outcomes exhibit considerable improvement in SE under different categories of noises.

Highlights

In the globalized era, the most essential for Speech Enhancement (SE) is rejecting the microphone interferences due to the noises in the speech utterances
The Time-Delay Estimation (TDE)-based VVMD was performed on the Intrinsic Mode Functions (IMFs) to elect the most relevant IMFs according to the Hurst exponent and lessen the low- as well as high-frequency noise elements in the speech signal
The TDE-based VVMD was performed on the IMFs to elect the most relevant IMFs according to the Hurst exponent and lessen the low- as well as high-frequency noise elements in the speech signal

Summary

INTRODUCTION

The most essential for SE is rejecting the microphone interferences due to the noises in the speech utterances. The number filtering iterations can be adjusted based on the sampling rate, assessed signal, its consistency, and frequency band This was done via decomposing the signal windows according to the usual method and calculating the actual number of filtering iterations. The window sliding process was achieved by extracting the features related to the window after obtaining the mode in the current iteration These features were accumulated in the IMFs array with a suitable time index. These are statistical frameworks that take a long time for computations It needs to introduce advanced deep learning techniques to reduce the computational difficulty and improve the speech signal quality efficiently.

LITERATURE SURVEY

Speech Signals Decomposition using SWEMD-VVMDH Technique

Feature Extraction and Labeling

Findings

CONCLUSION