Abstract

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the local and sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this letter, we propose an efficient E2E SE model, termed WaveCRN. Compared with models based on convolutional neural networks (CNN) or long short-term memory (LSTM), WaveCRN uses a CNN module to capture the speech locality features and a stacked simple recurrent units (SRU) module to model the sequential property of the locality features. Different from conventional recurrent neural networks and LSTM, SRU can be efficiently parallelized in calculation, with even fewer model parameters. In order to more effectively suppress noise components in the noisy speech, we derive a novel restricted feature masking approach, which performs enhancement on the feature maps in the hidden layers; this is different from the approaches that apply the estimated ratio mask to the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the SRU and the restricted feature map, WaveCRN performs comparably to other state-of-the-art approaches with notably reduced model complexity and inference time.

Highlights

  • Speech related applications, such as automatic speech recognition (ASR), voice communication, and assistive hearing devices, play an important role in modern society

  • We propose an E2E waveformmapping-based speech enhancement (SE) method using an alternative CRN, termed WaveCRN1, which combines the advantages of convolutional neural networks (CNN) and

  • We aim to show that simple recurrent units (SRU) is superior to long short-term memory (LSTM) in terms of the denoising capability and computational efficiency, when applied to waveform-based SE

Read more

Summary

Introduction

Speech related applications, such as automatic speech recognition (ASR), voice communication, and assistive hearing devices, play an important role in modern society. Most of these applications are not robust when noises are involved. A class of SE systems carry out enhancement on the frequency-domain acoustic features, which is generally called spectral-mapping-based SE approaches. In these approaches, speech signals are analyzed and reconstructed using the short-time Fourier transform (STFT) and inverse STFT, respectively [9]–[13]. The deep learning models, such as fully connected deep denoising auto-encoder [3], convolutional neural networks (CNNs) [14], and recurrent neural networks (RNNs) and long short-term memory (LSTM) [15], [16], are used as a transformation function to convert

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call