Abstract
Human speech in real-world environments is typically degraded by the background noise. They have a negative impact on perceptual speech quality and intelligibility which causes performance degradation in various speech-related technological applications, such as hearing aids and automatic speech recognition systems. It also degrades the original phase of the clean speech and introduces perceptual disturbance which leads to the negative impacts on the quality of speech. Therefore, speech enhancement must vigilantly be dealt with in everyday listening environments. In this article, speech enhancement is performed using supervised learning of spectral masking. Deep neural networks (DNN) and recurrent neural networks (RNN) are trained to learn the spectral masking from the magnitude spectrograms of the degraded speech. An iterative procedure is adopted as a post-processing step to deal with the noisy phase. Additionally, an intelligibility improvement filter is also used to incorporate the critical band importance function weights where higher weights contribute more towards intelligibility. Systematic experiments demonstrated that the proposed approaches greatly attenuated the background noise. Also, they led to large improvements of the perceived speech quality and intelligibility, as well as automatic speech recognition. In experiments, TIMIT database is used. The STOI is improved by 17.6% over the noisy speech. Also, SDR and PESQ are improved by 5.22dB and 19% over the noisy speech utterances. These comparisons showed that the proposed speech enhancement approaches outperformed the related speech enhancement methods.
Highlights
From the Tables, we observed that the spectral masking-based methods with iterative time-domain speech recovery and the intelligibility improvement filter performed better when applied with recurrent neural network (RNN) and Deep neural networks (DNN) frameworks
RNN-ideal ratio mask (IRM)-iSR, RNN-ideal binary mask (IBM)-iSR and RNN-ideal amplitude mask (IAM)-iSR improved the PESQ at −3dB white noise by factors 0.85, 0.81 and 0.86 over the noisy speech signal whereas improved the PESQ by 2%, 2.04% and 1.01%
The overall signal-to-noise ratio (SNR) and Segmental SNR (SSNR) for RNN-iSR and DNN-iSR are higher than the competing state-of-the-art methods
Summary
The aforesaid speech enhancement methods are apt for many real-time speech-related applications since they present a small computational complexity, but their performance remains poor for many real-world acoustic environments where they fail to track the power spectral density of an extremely non-stationary background noise. To surmount this issue, the supervised learning-based speech enhancement methods have been opted and trained with a large quantity of the training data in presence of different background noises [9], [10]. Regression, spectral-mapping and spectral masking-based deep neural networks are among the most successful methods in single-channel speech enhancement tasks [11]–[15]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have