Abstract
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the local and sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this letter, we propose an efficient E2E SE model, termed WaveCRN. Compared with models based on convolutional neural networks (CNN) or long short-term memory (LSTM), WaveCRN uses a CNN module to capture the speech locality features and a stacked simple recurrent units (SRU) module to model the sequential property of the locality features. Different from conventional recurrent neural networks and LSTM, SRU can be efficiently parallelized in calculation, with even fewer model parameters. In order to more effectively suppress noise components in the noisy speech, we derive a novel restricted feature masking approach, which performs enhancement on the feature maps in the hidden layers; this is different from the approaches that apply the estimated ratio mask to the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the SRU and the restricted feature map, WaveCRN performs comparably to other state-of-the-art approaches with notably reduced model complexity and inference time.
Highlights
Speech related applications, such as automatic speech recognition (ASR), voice communication, and assistive hearing devices, play an important role in modern society
We propose an E2E waveformmapping-based speech enhancement (SE) method using an alternative CRN, termed WaveCRN1, which combines the advantages of convolutional neural networks (CNN) and
We aim to show that simple recurrent units (SRU) is superior to long short-term memory (LSTM) in terms of the denoising capability and computational efficiency, when applied to waveform-based SE
Summary
Speech related applications, such as automatic speech recognition (ASR), voice communication, and assistive hearing devices, play an important role in modern society. Most of these applications are not robust when noises are involved. A class of SE systems carry out enhancement on the frequency-domain acoustic features, which is generally called spectral-mapping-based SE approaches. In these approaches, speech signals are analyzed and reconstructed using the short-time Fourier transform (STFT) and inverse STFT, respectively [9]–[13]. The deep learning models, such as fully connected deep denoising auto-encoder [3], convolutional neural networks (CNNs) [14], and recurrent neural networks (RNNs) and long short-term memory (LSTM) [15], [16], are used as a transformation function to convert
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.