Abstract
In recent years, deep learning models have been employed for speech enhancement. Most of the existing methods based on deep learning use fully Convolutional Neural Network (CNN) to capture time–frequency information of input features. Compared with CNNs, it is more reasonable to use Long Short-Term Memory (LSTM) network to capture contextual information on the time axis of features. However, the computation load of a fully LSTM structure is heavy. To balance the model complexity and the capability of capturing time–frequency features, we present an LSTM-Convolutional-BLSTM Encoder-Decoder (LCLED) network for speech enhancement. The LCLED additionally incorporates transpose convolution and skip connection. The key idea is that we use two LSTM parts and convolutional layers to model the contextual information and frequency dimension features, respectively. Furthermore, in order to achieve a higher quality of enhanced speech, a priori Signal-to-Noise Ratio (SNR) is applied as the learning target of LCLED. The Minimum Mean-Square Error (MMSE) approach is used for postprocessing. The results indicate that the proposed LCLED not only reduces the model complexity and training time but also improves the quality and the intelligibility of enhanced speech compared with the fully LSTM structure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.