Abstract
Speech enhancement system is applied in many devices such as hearing aids. To improve speech quality retrieved from noisy observations, this paper proposes a two-stage network with the frequency-time dilated dense network (FTDDN). This improvement lies in 3 aspects. Firstly, both frequency modeling and temporal modeling are considered to optimize a time-frequency mask; Secondly, to acquire the large receptive field, dilated convolution is incorporated into 3 basic processing units: frequency-dilated convolutional unit (FDCU), time-dilated convolutional unit (TDCU), and frequency-time dilated convolutional unit (FTDCU); Thirdly, for any one of them, 12 units were densely connected to assemble a frequency-dilated dense block (FDDB), a time-dilated dense block (TDDB), or a frequency-time dilated dense block (FTDDB), all of which are combined with some feature mapping operators to build up an FTDDN. With the above considerations, high-quality speech can be retrieved via implementing information reuse and feature fusion operations on two FTDDNs in a two-stage model. Using Librispeech and VCTK data sets, we conducted several experimental comparisons between our method and the state-of-the-art speech enhancement methods, showing that our proposed model outperforms these baseline models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.