Abstract

In recent years, Residual Networks (ResNets) have significantly increased the modeling power of convolutional neural networks (CNNs) by introducing residual connections. In this paper, we explore the incorporation of spatial and channel attention into the structure of ResNets for noisy speech recognition tasks. In our experiments, we implemented spatial attention as a bottom-up top-down structure where the input features are first down sampled and then up sampled to generate attention maps. At each block of the ResNet, the generated CNN features are composed with spatial attention maps over the temporal-frequency space, learning to attend to salient acoustic features and suppress noise. Our model also includes channel attention that attends to different channels of feature maps. ResNet blocks with spatial and channel attention modules can be easily stacked to construct deeper networks. We show that the proposed network structure has the ability to suppress noisy signals in speech audio without requiring parallel clean speech for training, and achieve promising WER reductions on CHiME2 and CHiME3.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call