Abstract

This paper proposes a novel fully convolutional neural network (FCN) called FLGCNN to address the end-to-end speech enhancement in time domain. The proposed FLGCNN is mainly built on encoder and decoder, while the extra convolutional-based short-time Fourier transform (CSTFT) layer and inverse STFT (CISTFT) layer are added to emulate the forward and inverse STFT operations. These layers aim to integrate the frequency-domain knowledge into the proposed model since the underlying phonetic information of speech is presented more clearly by time–frequency (T-F) representations. In addition, the encoder and decoder are constructed by the gated convolutional layers so that the proposed model can better control the information passed on in the hierarchy. Besides, motivated by the popular temporal convolutional neural network (TCNN), the temporal convolutional module (TCM) which is efficient in modeling the long-term dependencies of speech signal is inserted between encoder and decoder. We also optimize the proposed model with different utterance-based objective functions to exploit the impact of loss functions on performance, because the entire framework can realize the end-to-end speech enhancement. Experimental results have demonstrated that the proposed model consistently gives better performance improvement than the other competitive methods of speech enhancement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call