Abstract

In recent years, Residual Networks (ResNets) have significantly increased the modeling power of convolutional neural networks (CNNs) by introducing residual connections. In this paper, we explore the incorporation of spatial and channel attention into the structure of ResNets for noisy speech recognition tasks. In our experiments, we implemented spatial attention as a bottom-up top-down structure where the input features are first down sampled and then up sampled to generate attention maps. At each block of the ResNet, the generated CNN features are composed with spatial attention maps over the temporal-frequency space, learning to attend to salient acoustic features and suppress noise. Our model also includes channel attention that attends to different channels of feature maps. ResNet blocks with spatial and channel attention modules can be easily stacked to construct deeper networks. We show that the proposed network structure has the ability to suppress noisy signals in speech audio without requiring parallel clean speech for training, and achieve promising WER reductions on CHiME2 and CHiME3.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.